Categories
General

The elusive memory error

It involved a crash with 32 bit optimized build of an executable which was running fine with 32 bit or 64 bit debug build and even with 64 bit optimized build. No crash with debug build, hence no debug symbols! I started printing out logs “inside this function : All looking good”. This exercise led me to the culprit function. Nothing outrageous was happening inside this function. A few more printf’s led me to the line where the crash occurred.

void crashy_func()
{
...
...
...
my_list *pList = NULL;
initialize_list (&pList);

my_container->list = pList;

if (cond)
call_some_func(my_container);
else
call_some_other_func(my_container);

std::cout << "Size of list " << size ; // And here it crashed
...
...
...
}

Points to be noted here –
1. Valgrind didn’t show any memory error in this area of code.
2. Neither call_some_func() nor call_some_other_func() freed my_container->list. In fact they didn’t make any change at all to my_container->list.
3. At the point of crash, the value of pList and my_container->list was different. While my_container->list had the same value it was assigned, the memory address pointed to by pList was showing some junk value. Conclusion – pList got corrupt somewhere. But how could it be possible?

Initially, my guess was that the compiler optimized away the local variable pList. If that is the cause, then why didn’t 64 bit optimized build fail? Or how can such a trivial bug happen in gcc? Therefore it was an utterly wrong guess.

Anyway, I made a temporary fix by replacing the crashing line with –
std::cout << "Size of list " << list->size ; // And here it stopped crashing.

Only it was not a fix exactly. Some memory address got corrupt and I avoided accessing it by accessing some other memory address containing the same value.

So how did I find out the problem?

I attached to gdb an optimized build which was not stripped off symbols. gdb has a handy command disassemble which I used to ‘diassemble’ crashy_func().

The native code displayed by the disassemble command showed that the compiler inlined call_some_func(). I could get the memory location at which variable pList‘s value was stored. You can locate it in the native code by looking for nearby function calls which the compiler didn’t optimize. And then try to correspond the native code with actual C/C++ code. There is no formula to do this but only comparing the C/C++ code and the native code and trying to figure out which lines of native code correspond to which line in C/C++ code are the only ways.

But this didn’t tell me how the variable pList got corrupt. What I needed was a watch on the memory location that stored the pointer pList. I ran the executable in gdb and around the line initialize_list (&pList); I started printing the native code.

The command I used –

x /10i $pc <-- This prints next 10 instructions.

To clarify things,
pList is a pointer to an address where the list is stored.
my_container->list points to the memory location where the field list of the struct my_container is stored.

For the line "my_container->list = pList;", there would be instructions to move the value from one register to another. One of those registers would contain the memory address at which the list pointed to by the pointer pList is stored.

Once I got the memory address that stored the pointer pList, a watch on that memory address revealed that a static buffer overflow caused the corruption. It was done by an sprintf();

The bottom line is if you get a spooky crash and do not have a debug build or can not reproduce it in a debug build, do not panic. What the whole experience taught me was – make use of gdb as much as possible. Here's a list of a few handy gdb commands.