By Ryan Kennedy.
One of my favorite things is debugging problems. I don’t know why, but I genuinely enjoy it and I think I’m actually good at it. This is the story of one of my favorite debugging sessions ever.
I joined Yahoo! Mail back in late 2004. After doing mostly Java in my undergrad and then nothing but Java professionally afterwards, I was doing C++ and PHP at Yahoo!. A bunch of the PHP had to wrap underlying backend C++ libraries. C++ and I…were not friends in the least. I was used to the JVM hiding pointers and memory management from me. Nevertheless, after much bellyaching, I managed to get things working.
During internal testing we were getting sporadic complaints of file uploads failing. No HTTP errors from the server…the connection would just close itself. My more experienced coworkers told me this was typically the behavior seen when an Apache process would crash. The evidence would be found in core files on the affected machines. Sure enough, once I’d figured out where these mysterious artifacts could be found (having been a Java programmer for most of my life, core dumps were new to me) I quickly located quite a number of very large core files.
I had to figure out which of the core files (if any) were related to the problem I was investigating, which meant needing to learn enough GDB to load a core dump along with all the necessary symbols to be able to make sense of where things had gone wrong. At this point I had only a basic understanding of GDB (I had an unconventional undergrad experience having blown through a Computer Science degree in 2 years after spending 3 years as a Physics/Chemistry major), but I quickly figured out how to load the core dump and at least get a back trace. None of the back traces had anything to do with file uploads…they were segmentation faults all over the place. I started looking at the code indicated, but nothing looked out of place. I was completely unable to find any code that could be causing a segmentation fault.
About this time one of our frontend engineers caught the upload failure as it happened and called me over. He showed me again and again how the server would drop the connection. I asked for a copy of the attachment and went back to my desk. I sent the attachment to my own local development instance and watched, happily, as my process also crashed. This was the first breakthrough…a reproducible case. I put Apache into single process mode, attached GDB, and ran the request again. GDB caught the segmentation fault and dumped the stack trace. Unfortunately it was in an incredibly bizarre location. I had literally no idea what was going on.
The problem had a certain smell, however. It reminded me of something I’d seen in a previous job. I worked in Java at that job, but we had JNI wrappers for a vendor supplied library. I modified the wrapper once and it blew up in my face in a really non-obvious way (stack traces pointing to bizarre locations). A much more experienced engineer told me it sounded like I was “smashing the stack.” I had an array on the stack and I was writing off the end of it, blowing up bits of the stack along the way.
Determined I was encountering the same issue, I started wondering how on earth one finds memory corruption like this. Yahoo! Mail was an enormous codebase…I couldn’t just go spelunking for the problem. I needed help. During college, my senior project advisor had lent me a copy of Linux Application Development (I’m not sure why I never gave it back). On a whim, I flipped through it until I found a section on memory. In there, it talked about a tool called Electric Fence. Electric Fence replaced the system allocator, erecting barriers on either side of the allocated memory to detect buffer under and over flows.
I excitedly got back on the computer and began looking for it. I found a copy for FreeBSD, plugged it into my Apache module, restarted Apache, connected GDB, sent the doomed upload, and watched it fail exactly the same way: SIGSEGV instead of the expected SIGBUS Electric Fence ought to throw when an overflow occurred. “What the heck?”, I thought. I spent some time looking at the fine print in the documentation and noticed that by default Electric Fence would allocate a full page and set the barrier on the next page. So small overflows wouldn’t trigger Electric Fence. I found the setting (EF_ALIGNMENT) that put the barrier on the very next byte after what was requested in allocation, re-did the setup, and BOOM…SIGBUS. I ran the backtrace and found myself in the portion of code that was constructing the MIME body part, copying in the contents of the attachment provided.
It turned out that the underlying library could be called in different orders to construct a MIME message. Old Yahoo! Mail called it one order and New Yahoo! Mail (the one I was building) called it in another order. The order I was calling it in caused the buffer used to hold the attachment not to be properly initialized. As a result, attachments of a certain type and size (I remember it being nuanced, which is why it didn’t happen all the time) could overflow the buffer into undetermined space. I filed a bug against the team owning the library, updated my code to work around the ordering problem, and re-ran my test successfully.
This was 5 years into what is now a 15 year career and it is still one of the best, if not the best, bugs I’ve ever tracked down and fixed. Mostly I think I liked it because I had to learn so many new things to figure it out. So solving the problem felt like a tremendous accomplishment.
Thanks to Bruce Perens for his wonderful tool, Dr. Emilia Villareal for lending me the book (I owe you a copy of the new edition), and the inspiring Julia Evans for asking me to write this up.