1
votes

We are trying to troubleshoot a nasty problem on a production server where the server will start misbehaving after running for awhile.

Diagnostics have led us to believe there may be a bug in a DLL that is used by one of the processes running on this server that is resulting in a global atom leak. The assumed vector is a process that is calling RegisterClass without a corresponding UnregisterClass (and the class name is using a random number as part of the name, so it's a different class name each time the process starts).

This article provided some information: https://blogs.msdn.microsoft.com/ntdebugging/2012/01/31/identifying-global-atom-table-leaks/

But we are reluctant to attempt kernel mode debugging on a production server, so we have tried installing windbg and using the !gatom command to list atoms for a given session.

I use windbg to attach to a process in one of the sessions (these processes are running as Windows Services if that matters), then invoke the !gatom command. The returned atom list doesn't have any window classes in it.

Then I read this: https://blogs.msdn.microsoft.com/oldnewthing/20150429-00/?p=44984

and it sounds like there is a separate atom table for windows classes. And no way to query it. I was hoping that we'd be able to actually see how many windows class atoms have been registered, and see if that list gets bigger over time, indicating a leak.

The documentation on !gatom is sparse, and I'm hoping I can get some expert confirmation or recommendations on how to proceed.

Does anyone have any ideas on how we can get at the list of registered Windows classes on a production server?


More detail about what happens when the server starts to misbehave:

We run many instances (>50) of the same application as separately registered services running from isolated executables and DLLs - so each of those 50 instances has their own private executables and DLLs.

During their normal run, the processes unload and reload a DLL (about every hour). There is a windows class used that's part of a "session handle" used by the DLL (the session handle is part of the registered windows class name), and that session handle is unique each time the DLL is loaded. So every hour, there is an additional Window class registration, made by a DLL (our service stays running).

After some period of time, the system will get into a state where further attempts to load the DLL in question fail. This may happen for one of the services, then gradually over time, other services will start to have the same problem.

When this happens, restarting the service does not fix the problem. The only way that we've found to get things running properly again is to reboot the server.

We are monitoring memory commit load, and we are well within the virtual memory of the server. We are even within the physical memory size.

I just did a code review the vendor of the DLL, and it looks like they are not actually calling RegisterClass from the DLL itself (they only make one RegisterClass call from the DLL, and it's a static string - not a different class name for each session). The DLL launches an EXE, and that EXE is the one that registers the session specific class name. Their EXE does call UnregisterClass (and even if it didn't, the EXE is terminated when we unload their DLL, so it seems that this may not be what is going on).

I am now out of bullets on this one. The behavior seems like some sort of resource leak or pool exhaustion. The next time this happens, I will try connecting to the failing process with windbg and see what the application atom pool looks like - but I'm not hopeful that is going to shed any light.


Update: The excellent AtomTableMonitor tool has narrowed the problem to rogue RegisterWindowMessage. I'm going to ask a more specific question focused on this exact issue: Diagnosing RegisterWindowsMessage leak

1
"it's a different class name each time the process starts" - Aren't window class atoms stored in per-process atom tables? If that is the case, then there is no issue with using a different class name for each run of the process.IInspectable
@IInspectable The RegisterClass docu states: All window classes that an application registers are unregistered when it terminates., and then: No window classes registered by a DLL are unregistered when the DLL is unloaded. A DLL must explicitly unregister its classes when it is unloaded.ssbssa
@ssbssa: So all window classes are unregistered, when a process shuts down, irrespective of the module that registered them. This appears to be a strong indication, that window classes are stored in per-process atom tables.IInspectable
have you checked this standalone monitor thundaxsoftware.blogspot.in/2012/02/… it appears this can scan the service atoms too that run in a different session i haven't tried so test with caution as to kernel debugging you can do most of the commands listed in the ntdebugging blog using sysinternals livekd (doesn't require a /debug on bcdedit setting )blabb
@IInspectable Window classes are stored in a system-wide table. (For example, you can call FindWindow with a class atom to find a window in another process.) It's just that the system cleans up the atoms created by a process when the process exits.Raymond Chen

1 Answers

1
votes

You may try using this standalone global atom monitor
The application appears to have capabilities to monitor atoms in services
that run in a different session

btw if you have narrowed it to RegisterWindowMessage then spy++ can log the Registered messages system wide along with thread and process

spy++ (i am using it from vs2015 community)

enter image description here

ctrl+m select all windows in system

in the messages tab clear all and select registered

and start logging

you can also save the log (it is plain text in-spite of strange extension )

powershell -c "gc spy++.sxl -Tail 3"

<000152> 001F01A4 P message:0xC1B2 [Registered:"nsAppShell:EventID"] wParam:00000000 lParam:06EDFCE0 time:4:2
7:49.584 point:(408, 221)
<000153> 001F01A4 P message:0xC1B2 [Registered:"nsAppShell:EventID"] wParam:00000000 lParam:06EDFCE0 time:4:2
7:49.600 point:(408, 221)
<000154> 001F01A4 P message:0xC1B2 [Registered:"nsAppShell:EventID"] wParam:00000000 lParam:06EDFCE0 time:4:2
7:49.600 point:(408, 221)