0
votes

How can a Azure Function (v1) with Cosmos DB Trigger recover from a Cosmos DB Outage? Should this happen automatically or is a Function App restart required?

In our scenario, Cosmos DB was unavailable because the subscription spending limit was reached. After removing the spending limit, Cosmos DB was available again and Functions writing to Cosmos DB through Output Bindings succeeded.

The function who was connected to Cosmos DB through a CosmosDB Trigger however didn't recover from this outage and constantly threw the following exception:

Microsoft.Azure.WebJobs.Host.Listeners.FunctionListenerException: The listener for function 'xxx' was unable to start. ---> System.NullReferenceException: Object reference not set to an instance of an object.
   at Microsoft.Azure.Documents.ChangeFeedProcessor.ChangeFeedEventHost.<StartAsync>d__77.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Microsoft.Azure.Documents.ChangeFeedProcessor.ChangeFeedEventHost.<RegisterObserverFactoryAsync>d__3.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Microsoft.Azure.WebJobs.Extensions.DocumentDB.CosmosDBTriggerListener.<StartAsync>d__8.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Microsoft.Azure.WebJobs.Host.Listeners.FunctionListener.<StartAsync>d__14.MoveNext()
   --- End of inner exception stack trace ---

After restarting the Function App, the Cosmos DB Trigger was working again.

I suppose, this could situation could also happen when Cosmos DB suffers from outages for other reasons.

In order to make our system resilient against temporary outages of Cosmos DB, how can we get CosmosDB Triggers back to an operational state? Do we have to restart the function app in case of errors or is there a better way to do so?

1

1 Answers

1
votes

This is due to an issue with the retrying mechanism in the Azure Functions runtime and the Cosmos DB Trigger.

The runtime keeps trying to reinitialize the Trigger, exponentially backing-off, but it was running into an issue on this particular scenario (account over the spending limit). This was being tracked in a Github issue.

A fix for this has been merged and should be included in the next releases of the runtime.

When the fix is in place, you should not need to restart the Function App when the account is available again.