C++/msvc6 application crashes due to heap corruption, any hints?

Question

About the application

It runs on Windows XP Professional SP2.
It's built with Microsoft Visual C++ 6.0 with Service Pack 6.
It's MFC based.
It uses several external dlls (e.g. Xerces, ZLib or ACE).
It has high performance requirements.
It does a lot of network and hard disk I/O, but it's also cpu intensive.
It has an exception handling mechanism which generates a minidump when an unhandled exception occurs.
UPDATE: It is a highly multithreaded application and we are using mutexes to protect concurrent access (of course, we might be failing at some place...)

Facts about the crash

It only happens on multiprocessor/multicore machines and under heavy loads of work.
It happens at random (neither we nor our client have found a pattern yet) after some some hours running.
We cannot reproduce the crash on our testing lab. It only happens on some production systems (but always in multicore machines)
It always ends up crashing at the same point, although the complete stack is not always the same. Let me add the stack of the crashing thread (obtained using WinDbg, sorry we don't have symbols)

Exception code: c0000005 ACCESS_VIOLATION
Address        : 006a85b9
Access Type    : write
Access Address : 2e020fff
Fault address:  006a85b9 01:002a75b9 C:\MyDir\MyApplication.exe

ChildEBP RetAddr  Args to Child
WARNING: Stack unwind information not available. Following frames may be wrong.
030af6c8 7c9206eb 77bfc3c9 01a80000 00224bc3 MyApplication+0x2a85b9
030af960 7c91e9c0 7c92901b 00000ab4 00000000 ntdll!RtlAllocateHeap+0xeac (FPO: [Non-Fpo])
030af98c 7c9205c8 00000001 00000000 00000000 ntdll!ZwWaitForSingleObject+0xc (FPO: [3,0,0])
030af9c0 7c920551 01a80898 7c92056d 313adfb0 ntdll!RtlpFreeToHeapLookaside+0x22 (FPO: [2,0,4])
030afa8c 4ba3ae96 000307da 00130005 00040012 ntdll!RtlFreeHeap+0x1e9 (FPO: [Non-Fpo])
030afacc 77bfc2e3 0214e384 3087c8d8 02151030 0x4ba3ae96
030afb00 7c91e306 7c80bfc1 00000948 00000001 msvcrt!free+0xc8 (FPO: [Non-Fpo])
030afb20 0042965b 030afcc0 0214d780 02151218 ntdll!ZwReleaseSemaphore+0xc (FPO: [3,0,0])
030afb7c 7c9206eb 02e6c471 02ea0000 00000008 MyApplication+0x2965b
030afe60 7c9205c8 02151248 030aff38 7c920551 ntdll!RtlAllocateHeap+0xeac (FPO: [Non-Fpo])
030afe74 7c92056d 0210bfb8 02151250 02151250 ntdll!RtlpFreeToHeapLookaside+0x22 (FPO: [2,0,4])
030aff38 77bfc2de 01a80000 00000000 77bfc2e3 ntdll!RtlFreeHeap+0x647 (FPO: [Non-Fpo])
7c92056d c5ffffff ce7c94be ff7c94be 00ffffff msvcrt!free+0xc3 (FPO: [Non-Fpo])
7c920575 ff7c94be 00ffffff 12000000 907c94be 0xc5ffffff
7c920579 00ffffff 12000000 907c94be 90909090 0xff7c94be
*** WARNING: Unable to verify checksum for xerces-c_2_7.dll
*** ERROR: Symbol file could not be found.  Defaulted to export symbols for xerces-c_2_7.dll - 
7c92057d 12000000 907c94be 90909090 8b55ff8b MyApplication+0xbfffff
7c920581 907c94be 90909090 8b55ff8b 08458bec xerces_c_2_7
7c920585 90909090 8b55ff8b 08458bec 04408b66 0x907c94be
7c920589 8b55ff8b 08458bec 04408b66 0004c25d 0x90909090
7c92058d 08458bec 04408b66 0004c25d 90909090 0x8b55ff8b

The address MyApplication+0x2a85b9 corresponds to a call to erase() of a std::list.

What I have tried so far

Reviewing all the code related to the point where the crash ends happening.
Trying to enable pageheap on our testing lab though nothing useful has been found by now.
We have substituted the std::list for a C array and then it crashes in other part of the code (although it is related code, it's not in the code where the old list resided). Coincidentally, now it crashes in another erase, though this time of a std::multiset. Let me copy the stack contained in the dump:

ntdll.dll!_RtlpCoalesceFreeBlocks@16()  + 0x124e bytes  
ntdll.dll!_RtlFreeHeap@12()  + 0x91f bytes  
msvcrt.dll!_free()  + 0xc3 bytes    
MyApplication.exe!006a4fda()
[Frames below may be incorrect and/or missing, no symbols loaded for MyApplication.exe] 
MyApplication.exe!0069f305()
ntdll.dll!_NtFreeVirtualMemory@16()  + 0xc bytes    
ntdll.dll!_RtlpSecMemFreeVirtualMemory@16()  + 0x1b bytes   
ntdll.dll!_ZwWaitForSingleObject@12()  + 0xc bytes  
ntdll.dll!_RtlpFreeToHeapLookaside@8()  + 0x26 bytes    
ntdll.dll!_RtlFreeHeap@12()  + 0x114 bytes  
msvcrt.dll!_free()  + 0xc3 bytes    
c5ffffff()

(12-Apr-2010) I've tried to enable heap free checking (using gflags) but it slows down the application a lot...

Possible solutions (that I'm aware of) which cannot be applied

"Migrate the application to a newer compiler": We are working on this but It's not a solution at the moment.
"Enable pageheap (normal or full)": We can't enable pageheap on production machines as this affects performance heavily.

I think that's all I remember now, if I have forgotten something I'll add it asap. If you can give me some hint or propose some possible solution, don't hesitate to answer!

@David Alfonso: which STL implementation you are using? if you are using the one which comes with VC6 and accessing STL data structures across threads it will crash. — Naveen
Yes, we are using the VC6 implementation, but we are using mutexes to protect STL container access. — davidag
@David Alfonso: May be there is some bug in the locking implementation? To confirm probably you can compile your code with a thread-safe implementation such as STLPort and see whether it makes any difference. — Naveen
Have you investigated Electric Fence? It can help identify memory overruns using hardware protection... but it doesn't play too well with MFC's macro redefinitions of 'new', so it's a job of work to get integrated. — stusmith
@Naveen I think that it's not easy to include STLPort, as we depende in many libraries that are using normal STL... I remember I tried to do it some time ago... it would be undoubtedly a very interesting test. — davidag

Michael Burr Michael Burr · Accepted Answer · 2010-04-07T15:31:54

You can try peppering your code with calls to the debug heap checking routines to see if you can locate the corruption closer to the source (you're using the debug CRT to track down this problem, right?):

http://msdn.microsoft.com/en-us/library/aa271695(VS.60).aspx

C++/msvc6 application crashes due to heap corruption, any hints?

7 Answers