5
votes

We have a Azure setup with a Azure Event Grid Topic and to that we have a Azure Function Service with about 15 functions that are subscribing to the topic via different prefix filters. The Azure Function Service is set up as a consumption based resource and should be able to scale as it prefers.

Each subscription is set up to try deliveries for 10 times during maximum 4 hours befor dropping the event. So far so good and the setup is working as expected – most of the time.

In certain, for us unknown situations, it seems like the Event Grid Topic cannot deliver events to the different functions. What we can see is that our dead letter storage fill up with events that have not been delivered.

Now to my question

From the logs we can see the reason for various events not being delivered. The reason is most often Outcome: Probation. We can not find any information from Microsoft on what this actually means.

In addition, the Grid fails and adds the event to the dead letter log before both the timeout policy (4 hours) and the delivery attempts policy (10 retries) has exceeded. Some times the Function Service is idling and do not receive any events from the Grid.

Do any of you good people have ideas of how we can proceed with the troubleshooting for this? What has happened between the Grid and Funciton App when the error message Probation occurs? One thing that we have noticed is that the number of connections from the Grid to our function app is quite high in comparison to the number of events delivered. There are not other incoming connections to the Function App besides the Event Grid.

Example of a dead letter message

[{
   "id":"a40a1f02-5ec8-46c3-a349-aea6aaff646f",
   "eventTime":"2020-06-02T17:45:09.9710145Z",
   "eventType":"mitbalAdded",
   "dataVersion":"1",
   "metadataVersion":"1",
   "topic":"/subscriptions/XXXXXXX/resourceGroups/XXXX_STAGING/providers/Microsoft.EventGrid/topics/XXXXXstaging",
   "subject":"odl/type/mitbal/v1",
   "deadLetterReason":"TimeToLiveExceeded",
   "deliveryAttempts":6,
   "lastDeliveryOutcome":"Probation",
   "publishTime":"2020-06-02T17:45:10.1869491Z",
   "lastDeliveryAttemptTime":"2020-06-02T19:30:10.5756332Z",
   "data":"<?xml version=\"1.0\" encoding=\"utf-8\"?><Stock><Action>ADD</Action><Id>123456</Id><Store>123</Store><Shelf>1</Shelf></Stock>"
}]


Function Service Metrics

  • Blue = Connections (count)
  • Red = Function Executions (count)
  • White = Requests (count)

Stats

1
As for the early dead-lettering of messages before max delivery count or time-to-live, it would help to have a deeper look into what's going on. Could you send an email to azcommunity[at]microsoft[dot]com with a link to this thread?PramodValavala-MSFT

1 Answers

0
votes

I'm not sure if you have figured the issue here, but here are some insights for others in a comparable situation.

Firstly, probation is the outcome when the destination is not healthy, for which Event Grid would still attempt deliveries.

Based on the graph, it looks like functions hit the 100 executions mark and then took a while to scale out for the next 100. You could get better results by tweaking the host.json settings depending on what each function execution does.

Including scale controller logs could shed more light into what is happening internally when scaling out.

Also, another option would be to send events into service bus or event hubs first and then have a function run from there.