Troubleshooting XmlSerializer failure

Problem
MyApp application was experiencing identical issue in two independent production environments. Following 4-5 days of flawless system functioning, the application would become nonoperational. The malfunction manifested as a web application crash and a webservice error due to XmlSerializer failure. Once the XmlSerializer would start malfunctioning, the system would not recover. Restarting IIS or the app services would not rectify the issue.
The only way to recover the system was the machine restart. This was a short term kludge used for a week without experiencing any application problems. It was not acceptable as a long term solution. I was determined to address the root cause of the problem.

Troubleshooting
I started troubleshooting by examining the Windows Even Log first. The following entry pointed to XmlSerializer:

The description for Event ID 3 from source ProdMyApp cannot be found. Either the component that raises this event is not installed on your local computer or the installation is corrupted. You can install or repair the component on the local computer.
If the event originated on another computer, the display information had to be saved with the event.
The following information was included with the event:
l:MCA2000DotNetADKTGH.MyCompany.WebFormsBaseDataSource.cs[.ctor]
16:56:38.749 23-03-2011 UTC0x8bae0003
Failed to deserialize the UIEInputDefinitions. Input path: d:Program Files (x86)MyCompanyMyAppSystemsProdMyAppWebPagesApplication Centrexml Caught:System.IO.FileNotFoundException ‘Could not find file ‘C:WindowsTEMPjhnifqi7.dll’.’ raised in Void WinIOError(Int32, System.String) at System.IO.__Error.WinIOError(Int32 errorCode, String maybeFullPath) at System.IO.FileStream.Init(String path, FileMode mode, FileAccess access, Int32 rights, Boolean useRights, FileShare share, Int32 bufferSize, FileOptions options, SECURITY_ATTRIBUTES secAttrs, String msgPath, Boolean bFromProxy) at System.IO.FileStream..ctor(String path, FileMode mode, FileAccess access, FileShare share) at Microsoft.CSharp.CSharpCodeGenerator.FromFileBatch(CompilerParameters options, String[] fileNames) at Microsoft.CSharp.CSharpCodeGenerator.FromSourceBatch(CompilerParameters options, String[] sources) at Microsoft.CSharp.CSharpCodeGenerator.System.CodeDom.Compiler.ICodeCompiler.CompileAssemblyFromSourceBatch(CompilerParameters options, String[] sources) at System.CodeDom.Compiler.CodeDomProvider.CompileAssemblyFromSource(CompilerParameters options, String[] sources) at System.Xml.Serialization.Compiler.Compile(Assembly parent, String ns, XmlSerializerCompilerParameters xmlParameters, Evidence evidence) at System.Xml.Serialization.TempAssembly.GenerateAssembly(XmlMapping[] xmlMappings, Type[] types, String defaultNamespace, Evidence evidence, XmlSerializerCompilerParameters parameters, Assembly assembly, Hashtable assemblies) at System.Xml.Serialization.TempAssembly..ctor(XmlMapping[] xmlMappings, Type[] types, String defaultNamespace, String location, Evidence evidence) at System.Xml.Serialization.XmlSerializer.GenerateTempAssembly(XmlMapping xmlMapping, Type type, String defaultNamespace) at System.Xml.Serialization.XmlSerializer..ctor(Type type, String defaultNamespace) at System.Xml.Serialization.XmlSerializer..ctor(Type type) at TGH.MyCompany.CustomControlDataSource..ctor(String uieInputsDefinitionPath) in l:MCA2000DotNetADKTGH.MyCompany.WebFormsBaseDataSource.cs:line 83
d:Program Files (x86)MyCompanyMyAppSystemsProdMyAppWebPagesApplication Centrexml

My initial belief was that the dll was missing because it was not generated due to lack of disk space or appropriate permissions. Checking the disk space ruled the first possibility out. The permission issue was also out of question since the application worked fine for days and it only stopped working after certain period of time. In order to prove my theory, I modified the web.config file. Instead of writing the serialization temporary files to the c:\windowstemp folder, it was writing it them to d:\temp. Here is the code snippet added to web.config:

<system.diagnostics>
  <switches>
  <add name="XmlSerialization.Compilation" value="4"/>
  </switches>
</system.diagnostics>
<system.xml.serialization>
  <xmlSerializer tempFilesLocation="d:\\Temp"/>
</system.xml.serialization>

The code snippet also has another parameter which prevents system from deleting temporary files. Monitoring the Temp folder I could see that temporary files getting generated. In particular, files having extension *.err, *.out, *.tmp, and *.cs. However, the missing dll reported in the event log never did get generated. I have examined the application, system, and security event logs. No useful info was to be found pointing to a reason for failure to generate the missing dll. I turned to Process Explorer from Sysinternals to look at running processes. There were several instances of iexplore.exe running even though there were no IE browsers windows open. This was suspicious.

I terminated all the processes and tried running MyApp. No crash at this time. The system was functioning correctly. The battle has been won, but I had a long way to go before declaring the war over. I have identified the culprit, but determining what was initiating the creation of all the iexplore.exe processes would turn out to be a larger challenge. Before terminating the processes I have noticed that the root iexplore.exe process did not have a parent and it was running under NetworkService credentials. The w3wp.exe (IIS worker process) process was configured to execute under NetworkService credentials and recycle every 20 min or when idle. It led me to believe that it was responsible for creation of iexplore.exe processes and eventually getting terminated, leaving orphan iexplore.exe processes running. I continued monitoring the system which allowed me to prove the theory of w3wp.exe spawning iexplore.exe processes at irregular intervals.

There was nothing suspicious in the even logs. The system was behaving correctly. Could it be that antivirus software or some other scheduled tasks were responsible for creating processes? The fact that processes were spawned at irregular intervals ruled out that possibility. In the meantime, as a temporary workaround, I implemented a script that would kill all iexplore.exe processes every 30 min. This has “bought” me some time. I started believing that the issue was in one of Microsoft’s applications or .NET framework. Therefore, I asked Microsoft to help troubleshoot. The support engineer asked me to install DebugDiag and configure it to generate a process dump when w3wp.exe process would spawn another process. Before I could do that, I had to install symbols on the server since they couldn’t be downloaded off the internet due to lack of the internet connectivity on the server. Once the symbols were installed and DebugDiag configured, it was time to be patient and wait for iexplore.exe process to get created. I started up MyApp and tried to mimic a typical user’s interactions hoping to create iexplore.exe process. Several minutes later, I stumbled upon an operation that launched IE browser window opening a website. I looked at process explorer and noticed two instances of iexplore.exe running. The first one was the child of w3wp.exe and the second one was the root process with no children.

I closed the IE window which terminated the root iexplore.exe. The other instance continued running even though there was no IE window present. I repeated the entire procedure again and confirmed that MyApp was responsible for creation of iexplore.exe processes that eventually would lead to MyApp malfunction. Another battle won, and I was starting to smell a victory in the air. It was time to shift focus to the MyApp code. I identified the method in the web form code behind responsible for launching the browser. It was launching the URL by creating a process on the server:

System.Diagnostics.Process.Start(RetrievedURL);

The next line was also launching the URL, but this time it was on the client side using the script:

StringBuilder script = new StringBuilder();
script.Append("<script language=\"JavaScript\">");
script.Append("window.open(\"" + RetrievedURL + "\");");
script.Append("</script>");
ClientScript.RegisterStartupScript(this.GetType(), "LaunchURL", Regex.Unescape(script.ToString()));

It was obvious that launching a process from code behind would create the process on the server under w3wp.exe. The javascript code would create the process on the client. I asked myself why would anyone create a process on the server and then create another one on the client. There may be a plausible explanation.

Server/client side misunderstanding
The misunderstanding of server/client code execution produced the bug that was present in MyApp for almost two years. In my opinion, the developer was running Windows XP whose IIS process (asp_net.exe) as well as all other user processes execute in session 0 which is an interactive session. On the development machine creating a process by calling

System.Diagnostics.Process.Start(RetrievedURL);

would launch the browser window. It would appear that the code was functioning correctly. However, that would not be the case on the server. The server IIS process, w3wp.exe, executes in session 0, whereas user processes execute in subsequent sessions starting from session 1. Therefore the Process.Start method would create the process in session 0 which is not an interactive session. No UI window would be displayed. In order to compensate for that, the client side code was added which launches the browser in user session. The result was operation which was creating one process on the server and one process on the client.

Conclusion
The road to victory over the issue was interesting and time consuming. Even though the application worked as expected, at least for a while, the code contained a ticking “time bomb” which was hidden for 2 years.
A functioning code is not necessarily the correct code. Understanding the technology and adhering to proactive SDL practices such as code reviews and static code analysis produces correct code.