6
votes

Http Error 500.37 - ANCM Failed to Start Within Startup Time Limit

We are seeing this error on our Azure App Services running with .NET Core 3.1. It looks like when Azure updates the server farm, our instances get restarted and it tries to restart all app services at the same time. We do have a lot of services running on 1 instance, because it is a DEV/QA instance. The instance has enough resources for normal operation, but it looks when everything is restarted at the same time it takes more time.

The problem is that the app service doesn't recover from this, so our services only start working again when we restart the app manually.

Here they mention the error: https://docs.microsoft.com/en-us/aspnet/core/test/troubleshoot-azure-iis?view=aspnetcore-3.1#:~:text=500.37%20ANCM%20Failed%20to%20Start%20Within%20Startup%20Time%20Limit&text=By%20default%2C%20the%20timeout%20is,startup%20process%20of%20multiple%20apps.

But guidance here is to "stagger the startup process of multiple apps.", but on an update of the service farm I don't think we have that ability correct? That seems to be confirmed here: https://twitter.com/martincetkovsky/status/1231160330488774657?lang=en

Based on this: https://docs.microsoft.com/en-us/aspnet/core/host-and-deploy/aspnet-core-module?view=aspnetcore-3.1#attributes-of-the-aspnetcore-element

startupTimeLimit
Duration in seconds that the module waits for the executable to start a process listening on the port. If this time limit is exceeded, the module kills the process. The module attempts to relaunch the process when it receives a new request and continues to attempt to restart the process on subsequent incoming requests unless the app fails to start rapidFailsPerMinute number of times in the last rolling minute.

This implicates the app would retry, at least after 1 minute, but that doesn't seem to be case for us. Could this be incorrectly configuration on our end?

I would be ok with getting some of these errors after an update (it is DEV/QA after all), but if it doesn't recover, that is a problem. In prod we shouldn't see this, because we have more resources available, but also there auto recovery is important.

How can I make sure our services don't get stuck in this state? Other than having way too oversized server farms (with the associated cost)?

1
Same problem but I only have one small instance that has been running rock solid for over a year. Over night a call comes from a client to say that it's not working and this is the error. Very frustrating.DJA

1 Answers

6
votes

Based on recommendation of Microsoft, I went ahead and setup AutoHeal on our web apps.

This is the ARM template excerpt I am using:

    "autoHealEnabled": true,
    "autoHealRules": {
      "triggers": {
        "privateBytesInKB": 0,
        "statusCodes": [
          {
            "status": 500,
            "subStatus": 37, //Startup time limit 120000 in DEV and QA
            "win32Status": 0,
            "count": 1,
            "timeInterval": "00:01:00"
          }
        ]
      },
      "actions": {
        "actionType": "Recycle",
        "minProcessExecutionTime": "00:00:00"
      }
    }

The deployment of this change is still ongoing in our environment, so I haven't fully verified this solves the issue totally, but is seems promising.