8
votes

In my current project we (I mean "project team") use WCF services hosted on IIS.

Here are some technical details which may be important:

  1. We use NET 3.5 for WCF services
  2. We use NET.TCP communication protocol
  3. We use both IIS 7 and IIS 7.5 to host these services
  4. We use multiple IIS worker processes on each server

So, the problem is - sometimes WCF-services become unavailable. When we try to reach these WCF-services we get timeout error. And the only way to restore WCF-service functioning is to restart NetTcpActivator (Net.Tcp Listener Adapter) Windows service.

According to my colleague's theory, this error may be related to the problems described in this KB article:

FIX: Smsvchost.exe for the WCF service stops responding when you run a .NET Framework 4-based WCF service http://support.microsoft.com/kb/2536618

According to this article, SMSvcHost (container service which hosts NetTcpActivator and Port Sharing Service) hangs up if it can't route a request to w3wp (IIS worker process) in over 60 seconds (non-configurable timeout). Unfortunately, we are unable to find the way to reproduce this error. For example, we limited SMSvcHost to 1 CPU core and 1 thread and extended pending connections limit to 1M and pushing it to 100% CPU load in user mode. And it didn't hang!

Sometimes our load tests lead to strange errors, but when we stop them, all services automatically recover to their normal state. But sometimes not a heavy load may hang NetTcpActivator!

In addition, I would like to say that this is not a new problem. My colleagues already got it 3 years ago (see this thread for additional information http://forums.iis.net/t/1167668.aspx/1/10). And, unfortunately, they didn't get the answer. The problem just disappeared after some configuration changes! And now it came back on the new server.

I will really appreciate all you thoughts and ideas!

1
I have a ticket open with Microsoft regarding this. I'm able to reproduce frequently, though not reliably. So far, it appears to not be the same issue that you linked to since a fix for that is already out and memory dumps were different. Hopefully we'll be able to get a resolution to this and I'll post the update here.Nelson Rothermel

1 Answers

0
votes

Alright, after lots of research I tracked down the cause of our issue. There may be other scenarios where this occurs, but hopefully this will help some people. Microsoft is in the process of reproducing in their labs and should have a fix eventually.

In our case, all the planets had to align. We had one .NET 4 integrated app pool for client and server (on developer machine). The service was using an external config file for bindings (<bindings configSource="serviceModel.bindings.config" />) which was linked from another project and copied at build time with a custom build task added to the service's .csproj.

To reproduce the issue:

  1. Stop all SMSvcHost services that are running (Net.Tcp*, Net.Pipe, Net.Msmq). Restart won't work since the SMSvcHost process doesn't go away.
  2. From Visual Studio, run a Clean for WcfService
  3. From Windows Explorer, delete serviceModel.bindings.config in WcfService
  4. Run iisreset (gets rid of w3wp and starts SMSvcHost services -- press F5 is services list to see that)
  5. Build WcfService (copies the linked config file)
  6. Browse to WcfClient page, submit twice. If you get an error each time, you probably have the issue. On our main application it was giving a timeout, in the test app CommunicationObjectFaultedException instead of the timeout, but either is fine.
  7. Stop the SMSvcHost services. If the issue occurred, Event ID 8 for SMSvcHost is logged to the System event log.

I don't know yet if w3wp or SMSvcHost is the culprit. Step #3 is critical, though I can't explain why yet. If you don't delete the file, then all is fine. If you modify the file (created date stays the same), all is fine. If you move the config XML into the main Web.config file, all is fine. When the build task copies the file the created date is updated, so I am guessing it's cached some way and one of the processes detects the date change.

If you restart the SMSvcHost services (full stop, full start) once or twice the client request will go through and from then on you're fine.

So my guess for now is that this could be an issue right after a deployment, but if you make sure everything is running (and restart services as needed) then you should be fine. You can also not do the external/linked files.

Once Microsoft tracks down the issue I will hopefully have more insight.

Final Update I forgot to come back to this earlier. Microsoft essentially admitted they probably had a bug but since there was a workaround and had spent enough time on the ticket they were closing it and not researching further. There appears to be some type of race condition when SMSvcHost starts up with the following setup (similar to what I posted earlier):

  1. Host WCF in IIS
  2. Use a non-HTTP binding so that SMSvcHost comes into play
  3. Use external config file for bindings using configSource

Linking the external config had nothing to do with it. The workaround was to not use configSource which we are doing now.