Unable to fail Azure DevOps Release Pipeline when Azure WebJob fails to start/run

Question

What I'm looking for:

How can we integrate into our Release Pipeline an automated solution for knowing if a new WebJob deploy entered the running status within 'X' period of time?

More Details:

We are using Azure DevOps Release Pipelines with the AzureRMWebAppDelopyment@4 task. We are able to deploy our Azure WebJob to staging and to production environments.

Recently we discovered that our WebJob had not actually started because of some bad code. Due to the nature of the WebJob it was not something we could easily identify in staging. We deployed the bad code to production and days later, due to bad alerting, learned the WebJob was not running and our queue was severely backed up.

This issue is that we want/need our release pipeline to report failures to start for each WebJob. APIs use HealthChecks to verify a deployment started up, is healthy, and otherwise actually good to go. We need to inspect the status of the WebJob during our release pipeline so the pipeline fails so we don't think everything is working when it's not.

In our research we've found we could potentially use Kudu, but how to get that working as part of the release pipeline has proven unfindable for us so far.

dannydwarren dannydwarren · Accepted Answer · 2020-10-22T17:25:34

After combining ideas from multiple sources we came up with this solution:

In the desired Azure Release Pipeline in the desired stage add an Azure CLI task. This task can accept an inline PowerShell script or a path to a PowerShell script. Choose your own adventure. We chose to create a CheckWebJobStatus.ps1 with the included script (below) and exposed it as an artifact available to our Azure Release Pipeline.

What this PowerShell script does in short:
It checks the target WebJob's status up to 10 times (configurable via $totalRuns) waiting 5 seconds between checks and waits for 3 consecutive Running status reports.

param(
    $resourceGroup,
    $appServiceName,
    $jobName,
    $totalRuns = 10
)

Write-Host "Checking status of $jobName in $resourceGroup/$appServiceName"

$consecutiveRunningStatuses = 0
if ($totalRuns -lt 3) {
    Write-Error "totalRuns must be 3 or greater"
    exit 1
}

for ($i = 0; $i -lt $totalRuns; $i++) {
    $jobs = (az webapp webjob continuous list --name $appServiceName --resource-group $resourceGroup | ConvertFrom-Json)

    foreach ($job in $jobs) {
        if ($job.name -eq "$appServiceName/$jobName") {
            if ($job.status -eq "Running") {
                Write-Host "$jobName is running! Attempt $i"
                $consecutiveRunningStatuses++

                if ($consecutiveRunningStatuses -eq 3) {
                    Write-Host "$jobName is running $consecutiveRunningStatuses times in a row! We assume that means it is stable."
                    exit 0
                }
            }
            else {
                Write-Host "$jobName status is $($job.status). Attempt $i"
                $consecutiveRunningStatuses = 0
            }
        }
    }

    if ($i -ne ($totalRuns - 1)) {
        Start-Sleep 5
    }
}

Write-Host "$jobName failed to start after $totalRuns checks"
exit 1

Why 3 consecutive Running status reports?
Because Azure WebJobs status reporting is not reliable. When a WebJob first deploys it enters the Starting status then the Running status. So far that seems good. However, if there is a fatal error on startup like a missing dependency, the job then enters the Pending Restart status. In our observation Azure either automatically tries to start the WebJob again or the status gets weird and gets reported erroneously as being in the Running status. The WebJob will then re-enter the Pending Restart status and remain at that status until the next explicit attempt to deploy or start it. In our observations we did not see a failing WebJob remain in the Running status for more then 2 consecutive reports 5 seconds apart or, in other words, within any 15 second window. Therefore in the script we are assuming, for now, that if we get 3 consecutive Running status reports within 15 seconds the WebJob is assumed to be Running.

Aside - How we did it:
We created a dedicated DeployTools repo with its own azure-pipelines.yaml build configuration which only publishes the folder with that PowerShell file. Then in our desired Azure Release Pipeline we attached the artifacts from the DeployTools build.

Unable to fail Azure DevOps Release Pipeline when Azure WebJob fails to start/run

2 Answers