Debugging Memory Problems using Dervish

Robert Lupton, October 1997

This document could use a tutorial introduction -- offers of help, comments, or (gasp) HTML to me, please.

Some of these tools are not yet available in any cut version of Dervish --- but they are all checked into cvs; they should all be in version 6.8.

The hardest class of bug to find in C programs is probably memory problems, either leaks or corruption. Fortunately, dervish has a number of tools available to help you in your task.

In the following, the term heap will sometimes appear; it's the name of all the memory under shMalloc's control; that is, the memory that has been handed out by shMalloc (and maybe shFreed again). You should also know that every piece of memory that dervish hands to the user, whether via shMalloc or shRealloc, has a unique serial number (the same address may be used several times).

Dervish will always:

Furthermore, you can always ask dervish to: If you define CHECK_LEAKS to the C pre-processor, you will also be given: If you know how to ask (and if your application has a few lines of code compiled in), you can also get: Additionally, you can

How Dervish can Help

Dervish's Default Warnings

Whenever you Dervish will tell you, and (by default) call shFatal. (in fact, it will also do so when using an internal counter on a bad block, but don't worry about that).

The first class of problems are either a programming error (e.g. calling shFree on a variable that you declared as an array), or else the block is really valid, but the heap is corrupted; if the latter, consult the section on heap corruption.

You can diagnose the second class of problems using the tools discussed under memory callbacks.

Dervish's Memory Leak Tools

There is a tcl (dervish actually) procedure called memBlocksPrintRange that prints out all the blocks of memory that have been allocated but not freed. Because you have almost certainly got some blocks that are not going to be freed (e.g. stuff allocated in startup code), it actually only tells you about blocks in a specified range of memory serial numbers. The photo group uses the following proc as a convenient wrapper:

	proc mortal {args} {
	   global startMem

	   if {$args == "set"} {
	      set startMem [memSerialNumber]; return
	   }

	   if {![info exists startMem]} {
	      set startMem 0
	   }
	   memBlocksPrintRange [expr $startMem+1] [memSerialNumber] $args
}
Used as:
	allocate lots of stuff for startup
	mortal set
	do lots of work, cleaning up carefully
	if {[mortal] != ""} {
	   error "Found a memory leak"
	}
The output looks like:
0x10d7f1c0 {18 2 64 region.c 875}
0x10d58a50 {17 44 64 region.c 349}
0x10d33260 {16 3 64 region.c 340}
0x10d18fb0 {15 4 64 region.c 934}
0x10d674d0 {14 72 128 region.c 331 h0 REGION}
Where the first column (0x10d7f1c0)is the address that should have been freed, the second (18) is the serial number, the next two (2 64) are the number of bytes allocated, and the size of the internal block that dervish used to satisfy the request for memory. The next two fields (region.c 875) are the file and line number where shMalloc was called.

In the case of block 14, there are two additional fields (h0 REGION), which tell you that the block, of type REGION, is bound to handle 14.

Dervish's Memory Corruption Tools

There is a dervish proc memCheck that checks the entire heap for corruption; with the option -abort it'll call shFatal if it finds any. It is a good idea to call this at about the time that your tcl framework calls memBlocksPrintRange to check for memory leaks. It is very helpful to track down memory corruption as soon as it's introduced into your program, even before it starts leading to symptoms.

File and Line Information for shMalloc

If you compile your program with -DCHECK_LEAKS, every call to shMalloc, shRealloc, and shFree contains the file and line number where the call is made. This is used in error messages when dervish detects problems, as well as in the output from memBlocksPrintRange

Callbacks for allocating or freeing specified blocks

It is often helpful to be able to get control of a program when a particular memory block is allocated or freed. The commonest use of this capability is to catch memory leaks (or twice-freed pointers), but I find myself using it for other purposes too (see the section on breakpoints).

Let's first consider finding a memory leak. The output of memBlocksPrintRange indicates that the block with serial number 50258 is never freed. Add a function that looks like to your main program,

	static void
	malloc_trace(unsigned long thresh, const SH_MEMORY *mem)
	{
	   printf("Allocated block %ld\n", thresh);
	}
add a line
	shMemSerialCB(50258,malloc_trace);
recompile, and when block 50258 is allocated, a message is printed.

This may not seem very helpful, but when used in conjunction with a debugger things look up. Set a breakpoint in malloc_trace, and the program will stop when your block is allocated, which is usually enough to diagnose the problem.

Once you've decided to use a debugger, the whole procedure can be streamlined. Rather than adding the line

	shMemSerialCB(50258,malloc_trace);
only when block 50258 catches your fancy, leave the line
	shMemSerialCB(0,malloc_trace);
in permanently. Then use the debugger to set the variable shMalloc::g_Serial_threshold to 50258, and proceed as before (that's what gdb likes to call it; with e.g. dbx your mileage may vary).

If your problem is a doubly-freed pointer, you need to define malloc_free_trace, call

	shMemSerialCB(0,malloc_free_trace);
set a breakpoint in malloc_free_trace, and set shMalloc::g_Serial_free_threshold.

Checking for Heap Corruption

There's a C API p_shMemCheck to check the heap. I usually run it from a memory callback like that described in the previous section:
/*
 * This callback can used to check the heap for corruption at any desired
 * granularity (set by the variable frequency)
 */
static void
malloc_check(unsigned long thresh, const SH_MEMORY *mem)
{
   static int abort_on_error = 1;	/* abort on first error? */
   static int check_allocated = 1;	/* check allocated blocks? */
   static int check_free = 1;		/* check free blocks? */
   static int frequency = 10;		/* frequency of checks */

   shAssert(mem != NULL);		/* use it for something */

   if(frequency > 0) {
      p_shMemCheck(check_allocated, check_free, abort_on_error);

      shMemSerialCB(thresh + frequency, malloc_check);
   }
}
Followed by a call to
	shMemSerialCB(0,malloc_check);
and setting the variable shMalloc::g_Serial_threshold to whatever value you want to start checking the heap (set it to 1 to start at the beginning of your program. Using malloc_check will slow things down, so I usually increase the starting threshold as I localise the problem).

Recovering from Dervish Running out of Memory

There are also callbacks for dealing with fatal conditions, namely running out of memory, and detecting a problem in the heap. They are respectively
void *shMemEmpty(size_t n)
A function expecting a single argument. It must either allocate n bytes and return them, or not return at all. It's called when dervish has failed to allocate the desired memory, so simply calling shMalloc (or malloc) is unlikely to work; you'll have to free something first. Set by shMemEmptyCB
void shMemInconsistency(unsigned long thresh, const SH_MEMORY *mem)
Called when a problem in the heap is detected; mem is the offending block, and thresh is the current value of p_Serial_threshold. You needn't do anything in your callback function, simply returning will probably not lead to any trouble --- but you should fix the underlying problem immediately. Set by shMemInconsistencyCB

Setting Breakpoints for Particular Memory Blocks

Many objects in photo (e.g. OBJCs, OBJECT1s, and STAR1s) have their own unique ID numbers that are very handy for following them around using the debugger; for example, if I want to know why the OBJECT1_BLENDED flag is set in a particular OBJECT1, I can set breakpoints such as
	b file.c:123 if obj1->id == 123
Some data types, however, have no such luxury, but all is not lost as you can use their memory serial number; you can find this by saying (in gdb)
	p ((SH_MEMORY*)obj1 - 1)->serial_number
after which the preceeding break point could have been set as
	b file.c:123 if ((SH_MEMORY*)obj1 - 1)->serial_number == 12695

If I wanted to watch when that object was created, I could have registered a callback for memory block 12695.