How to debug long running programs

I recently chased down a crash in a software running on a large test case. The fix surprisingly is a one-liner in an area of code which has been touched only four times in last 8 years. It was a fairly complex piece of code and a completely unfamiliar territory for me. There was no help available as the person who wrote it and last worked on it left the company a few years back and there’s no design document they left behind. I think this is a pretty common scenario in large products having legacy code.

Fortunately, the bug was a deterministic one and was occurring in a monolithic program which ran sequentially. The program is actually a binary compiled and linked out of code written in C++ and runs on linux.

Where was the bug seen?
The crash happened inside a function sitting somewhere in the middle of some spaghetti code and I had no clue what that function did. It took more than an hour and a half to reproduce the crash after putting some effort to shorten the duration of the run before the crash. Although it was not a very long running program (I’ve seen people debugging bugs that took days to reach), it was long enough to make it slightly more challenging and difficult to fix than other bugs. I did fix it eventually using a few strategies which I came to adopt over the years. And these strategies can also be successfully applied to solve some of the difficult bugs in unknown code bases.

Strategy No. 1. Stop thinking “I can’t solve this bug”.
This is a slippery slope and should be avoided at all cost. Most of the bugs can be solved with methodical debugging and perseverance even though it occurs in someone else’s code. And sometimes some of the non-deterministic bugs can be solved too. It may take time and it may require you to learn new things, but it’s not impossible.
If you’re working in a team, chances are you often have to fix bugs in somebody else’s code. And, trying to enjoy the process is helpful. I like to discuss it with other people who I know are interested in chatting about technical topics even though they don’t work on it directly. Also, it boosts my self-confidence that I am good at solving difficult bugs and that in turn boosts other team members’ confidence in me as well. And if nothing else, I almost always learned something new once I solved a difficult bug.

Strategy No. 2. Throw away all the assumptions.
Don’t start with an assumption on the cause of the bug. It can help if you know the code in and out. But more often than not the starting assumptions about the cause of the bug turn out to be wrong. For instance, some of assumptions about this bug were “Oh, it’s a memory error” or “Oh, no other programs using this piece of code are not crashing, hence the real bug might be in this particular program”. They were all wrong and they don’t help. Your debugging gets influenced by such assumptions and you tend to miss out other clues that present to you during debugging.

I’m prone to making such assumptions. I’ve learned not to act upon them. Instead I follow a set of steps which are mentioned in the following sections.

Strategy No. 3. Try to reproduce the crash in a shorter running program.
This is helpful if you work on a program that takes a long time for large input sizes. It’s often worth the effort to reduce the input size and check if the bug is still reproducible. And that saves a lot of debugging time. But this is not always easy nor possible right at the beginning of debugging. I keep trying it throughout my debugging so that I can save time as much as possible.

Strategy No. 4. Use the right tools.
Since the program in this case is written in C++ and runs on linux, there exists a fairly good debugging echo system. gdb from gnu and lldb from clang are both good. Mastering these tools often pay large benefits. For instance, gdb can be scripted using python and it can be extremely helpful while debugging long running programs. There are other gdb commands like checkpoint which are very useful too. Last I checked the reverse debugging in gdb wasn’t good. But I’ve heard good reviews about undoDB which is a reverse debugger. rr from Mozilla is another reverse debugger and I’m yet to try it.

For very long running programs checkpointing the program using dmtcp also helps. That way you don’t have to spend a lot of time just in waiting for the program to reach the bug. But often it takes some effort to integrate dmtcp with large code bases. But it may be worth the effort.

Sometimes, enabling log messages or putting new ones to better understand the state of the program can be much more helpful than a debugger. Because a debugger executes the program slowly and I can run the program much faster with an optimized build executable with the log messages turned on. Meaningful log messages are good tools for understanding the behaviour of the program in unknown areas of code.

In two of my most memorable debugging experiences, existing and adding new log messages helped me solve the bugs much faster than if I used only the debugger. And if some of the new log messages that I add for debugging purpose turn out to be helpful, I try to add them permanently in the code.

Strategy No. 5. Dig deeper to understand the reason of the bug.
In a well-written code base the bugs and their causes should be localized. Meaning, the behaviour due to an erroneous code shouldn’t occur somewhere far from the culprit line or lines of code. But it’s not a perfect world and is very easy to write code in C and C++ where the bug is introduced in one place and gets manifested somewhere far away. The initial debugging may show what is wrong in the code but it may take some time to figure out the why. For example in one of the bugs I saw a local variable inside a function getting corrupted even though there was nothing suspicious happening inside the function to indicate it. This was the what part of the bug – a corrupt variable. The journey to find why led me to a static buffer overflow in another function that corrupted the value of the local variable. During debugging I had added a new local variable before the corrupt local variable to count something. And that stopped the crash. But this was a hack and not a real solution. The real solution involved finding out why the local variable got corrupt in the first place. Do not try out a bunch of stuff randomly to make the bug go away.

Strategy No. 6. Understand the area of code where the bug is seen.
It may not be necessary always but more often than not it is. Spending some time and effort to understand the area of unfamiliar code where the bug is visible is helpful. This time I had to understand the functionality of the code, the individual functions in the call stack and the assumptions with which they were written. And a few days later there was another bug in the same code area and this time I could solve the bug much faster.

Strategy No. 7. Verify your fix by adding a test case.
I always make it a point to add a test case in the regression suite for the bug I fixed. Sometimes it’s possible to add a simple unit case for the bug and sometimes it may require a big system test case. But a test case in the regression suite will ensure that the same bug will not appear twice.

Strategy No. 8. Write a post-mortem report on the bug.
Or make a blog post out of it. I do it because I want to find out if I could have debugged it more efficiently. And I also like to write if I learned something new.