Diagnosing Stack Overflow Faults in ASP.NET Production Environments

WinDbg is a tool that is immensely useful, but painfully hard to get any value from if you don’t know how to use it. On a few occasions I’ve found myself in a situation where I knew that WinDbg could give me the answer, but I didn’t have enough knowledge of how to use it.

This week we performed a production deployment on one of the projects I am involved in. Pretty soon after, we started to notice some issues cropping up that were affecting the performance and stability of the whole site.

In this scenario, WinDbg was a life saver for us.

The first sign that something was wrong was this event getting fired across the whole front-end tier:

03 Jul 2009 08:32:43 AM
Computer: WEBTIER14
Monitor Title: "Event Log Monitor" (Type=Event Log Monitor)
Description:
* Event Time: 03 Jul 2009 08:32:42 AM
* Source: .NET Runtime 2.0 Error Reporting
* Event Log: Application
* Type: Error Event
* Event ID: 1000
* Faulting application w3wp.exe, version 6.0.3790.3959, stamp 45d6968e, faulting module kernel32.dll, version 5.2.3790.4062, stamp 4626467c, debug? 0, fault address 0x00022366.

We also noted that CPU usage was spiking erratically and was well above the average we would expect for the number of firewall connections we had open.

This event log entry was telling us that the w3wp worker process was faulting in an unrecoverable way, causing the application pool to be torn down in IIS. The nature of the fault was also causing the Windows Error Reporting dw20.exe process to trigger and try capturing a mini dump. These rapid recycling combined with the CPU and IO intensive task of capturing a mini dump was killing our servers and we needed to act fast.

The first step was to work out what was causing the process to be torn down in the first place, as the impact of Windows Error Reporting’s automated analysis was only a secondary issue. To capture this information, I used the adplus script in the Debugging Tools for Windows package to capture a dump.

In its default configuration, adplus will capture a crash dump on any first chance exception. Being an ASP.NET application there are a number of exceptions that we expect to be thrown as 404s are reported, redirects are performed, etc. As such, running adplus in its default configuration was causing it to grab the crash dump too frequently and not giving us the fault that was causing the complete teardown.

To capture the process teardown fault, I ran adplus with these parameters:

adplus.vbs -pn w3wp.exe -c Unknown.cfg

The first parameter just says that we want to capture dump data for the w3wp.exe process. The second parameter is the interesting one – I am passing in a configuration file that includes some exception filters. I grabbed this config from Tess’ blog.

The content of the configuration file is fairly simple XML that describes the unknown fault scenario:

<adplus>
    <settings>
        <runmode> CRASH </runmode>
    </settings>
    <exceptions>
        <config>
            <code>AllExceptions</code>
            <actions1>Log</actions1>
            <actions2>MiniDump;Log;EventLog</actions2>
        </config>
        <newexception>
            <code> 0xe053534f </code>
            <name> Unknown_Exception </name>
        </newexception>
        <config>
            <code> 0xe053534f </code>
            <actions1>FullDump;Log;EventLog</actions1>
            <actions2>FullDump;Log;EventLog</actions2>
        </config>
    </exceptions>
</adplus>

With the debugger now attached, it was a simple matter of waiting for the problem to occur. In our scenario it was happening every 10 to 15 minutes per server, so it took a while for me to capture the dump. Finally, the console showed “Dump file successfully written”.

To analyse the dump file, I opened windbg.exe, chose Open Crash Dump and selected the .dmp file that had been written out. You’ll usually find it in a CrashDump folder next to where you ran adplus from.

As we are working with a .NET application, but WinDbg is a native debugger, my next step was to load a set of debugger helpers called SOS. This is basically a plugin for WinDbg that lets us access information about the CLR state.

To load it, run:

.loadby sos mscorwks

The period at the start says it’s a WinDbg core command. The next parameter is the name of the library we want to load into the debugger. The final parameter is basically a relative file base – we’re telling WinDbg to load SOS from the same location that mscorwks had been loaded from. mscorwks is the .NET runtime, so this approach helps ensure we are loading the correct version and architecture of SOS and it saves us from having to write out the full path.

Next, I wanted to see what point of our application we were in when the process faulted. With SOS loaded, the command is quite simple:

!clrstack

Instantly, it was obvious that we had a stack overflow problem. This was made clear by literally hundreds of repeated frames on our call stack. It also explained why our ASP.NET Health Monitoring wasn’t reporting an exception, as a stack overflow causes the whole process to be torn down before any exception handling code can be run.

Here’s a short snippet (with the project details masked out):

...
CompanyName.Project.Web.Logic.Component.Method1(System.String, System.Int32)
CompanyName.Project.Web.Logic.Component.Method2(System.String)
CompanyName.Project.Web.Logic.Component.Method1(System.String, System.Int32)
CompanyName.Project.Web.Logic.Component.Method2(System.String)
CompanyName.Project.Web.Logic.Component.Method1(System.String, System.Int32)
CompanyName.Project.Web.Logic.Component.Method2(System.String)
CompanyName.Project.Web.Logic.Component.Method1(System.String, System.Int32)
CompanyName.Project.Web.Logic.Component.Method2(System.String)
CompanyName.Project.Web.Logic.Component.Method1(System.String, System.Int32)
CompanyName.Project.Web.Logic.Component.Method2(System.String)
CompanyName.Project.Web.Logic.Component.Method1(System.String, System.Int32)
CompanyName.Project.Web.Logic.Component.Method2(System.String)
CompanyName.Project.Web.Logic.Component.Method1(System.String, System.Int32)
CompanyName.Project.Web.Logic.Component.Method2(System.String)
...

The component in question is used across almost all of our pages and the site wasn’t completely down, so I could now derive that it was being caused by a very specific piece of data. This leads us to our next challenge, which is to find out what which record was triggering the problem.

To do so, I ran:

!clrstack -p

The argument of -p asks SOS to list out all of the parameters for each stack frame.

Finding out the URL of the current request is a non-trivial task, and not all parameters are available due to runtime optimisations. Rather than persisting in a search for the URL, it was much easier to grab another stack frame which included the information I needed:

1947ed28 1abb1c7e CompanyName.Project.Web.Logic.Presenters.RetailItemPresenter.View_ItemLoading(System.Object, CompanyName.Project.Web.Logic.Views.ItemLoadingEventArgs)
PARAMETERS:
this = 0x04144bf8
sender =
e = 0x04151b38

This is the event handler on one of our presenters being called as a result of the ItemLoading event being fired on the corresponding view. This approach is specific to our MVP architecture, but it was a known and simple entry point into most of the stack. Try and look for an equivalent call in your own stack.

I could now see that it was a particular retail item triggering the problem, and I knew that the information I needed was in the event arguments. You can see the parameter on the stack frame shown as e = 0x04151b38. That number is the memory location of the object we need.

To get the object, I then ran:

!do 0x04151b38

The !do command is a shortcut for !DumpObject, and gave us this output:

Name: CompanyName.Project.Web.Logic.Views.ItemLoadingEventArgs
MethodTable: 19ef2170
EEClass: 19d49b70
Size: 12(0xc) bytes
(c:\WINDOWS\Microsoft.NET\Framework\v2.0.50727\Temporary ASP.NET Files\root\aef04219\e69835b3\assembly\dl3\29e5062003a4007_a4fac901\CompanyName.Project.Web.Logic.DLL)
Fields:
MT Field Offset Type VT Attr Value Name
7932a400 4000183 4c System.EventArgs 0 shared static Empty
>> Domain:Value 001cc630:NotInit 00204700:03043914 <<
793308ec 4000819 4 System.String 0 instance 0412522c <RetailItemId>k__BackingField

We can now see the RetailItemId property and its corresponding backing field. The value column on that line is 0412522c, which is another memory reference.

Once again, we dump the memory reference:

!do 0412522c

This returns us the corresponding System.String object:

Name: System.String
MethodTable: 793308ec
EEClass: 790ed64c
Size: 28(0x1c) bytes
(C:\WINDOWS\assembly\GAC_32\mscorlib\2.0.0.0__b77a5c561934e089\mscorlib.dll)
String: 6T046
Fields:
MT Field Offset Type VT Attr Value Name
79332b38 4000096 4 System.Int32 1 instance 6 m_arrayLength
79332b38 4000097 8 System.Int32 1 instance 5 m_stringLength
793315cc 4000098 c System.Char 1 instance 36 m_firstChar
793308ec 4000099 10 System.String 0 shared static Empty
>> Domain:Value 001cc630:02fe01d0 00204700:02fe01d0 <<
7933151c 400009a 14 System.Char[] 0 shared static WhitespaceChars
>> Domain:Value 001cc630:02fe0728 00204700:06fe0e14 <<

Hidden away on the sixth line there is the raw string value, 6T046, which was the product id.

At this point, we had:

  1. Captured a complete memory dump of the w3wp.exe process when the fault occurred
  2. Identified a stack overflow to be the cause of the fault
  3. Identified the area of our code that was involved in the stack overflow
  4. Identified the particular piece of data that was exposing the bug

We could now:

  1. Perform a quick fix to the data to circumvent the bug and get the website performant and stable again
  2. Reproduce the bug locally using the same data that caused the issue
  3. Have the lazy Friday we’d been hoping for

7 comments

  1. Well done! Great article, debugging memory dumps is an art not a science. Good to see you follow Tess’s blog. How about debugging .net 4.0 (beta 1) dumps Tatham? omg! Happy coding, Jezza

  2. Hi Jezza,

    How would a 4.0 dump be any different? Loading SOS relative to the location of mscorwks ensures that you are loading the correct version architecture. I can’t think of what other differences there would be?

    Note that VS2010 also has very good support for crash dump analysis, which would remove the need for windbg in this scenario. Only adplus would have been required to capture the dump in the first place.

    – Tatham

  3. We are trying to debug a StackOverflowException in a ASP.NET applicaton (.NET 4.0) by using the method described above. Unfortunately, all managed threads show with a thread id of XXXX, and we are not able to switch to any of them to run !clrstack.

    We would appreciate any suggestions.

    Here is an example of the !threads output:

    PreEmptive Lock

    ID OSID ThreadOBJ State GC GC Alloc Context Domain Count APT Exception

    XXXX 1 3b38 0000000000169520 1008220 Enabled 0000000000000000:0000000000000000 0000000000152230 0 Ukn (Threadpool Worker)

    XXXX 2 21960 000000000014baf0 b220 Enabled 0000000163d3c530:0000000163d3e470 0000000000152230 0 Ukn (Finalizer)

    XXXX 3 3a98 000000000628eb20 100a220 Enabled 0000000000000000:0000000000000000 0000000000152230 0 Ukn (Threadpool Worker)

    XXXX 4 1aba4 00000000062b4ad0 1220 Enabled 0000000000000000:0000000000000000 0000000000152230 0 Ukn

    …..

    XXXX 57 1bfe0 000000000ba0f550 1009220 Enabled 00000000c3ff8160:00000000c3ff8268 00000000062a0e10 1 Ukn (Threadpool Worker) System.StackOverflowException (00000000ffff01f0)

    XXXX 5f 19fa0 000000000b862c20 1009220 Enabled 0000000123c047e0:0000000123c06780 0000000000152230 0 Ukn (Threadpool Worker)

    XXXX 32 9d3c 000000000b864c60 1009220 Enabled 00000000c3419b30:00000000c341a268 00000000062a0e10 1 Ukn (Threadpool Worker)

    XXXX 67 11a8c 000000000bae3d00 1009220 Enabled 0000000000000000:0000000000000000 0000000000152230 0 Ukn (Threadpool Worker)

Comments are closed.