1
votes

I did not find much in the way of troubleshooting events lost scenario in the azure event grid.

Hence I am asking question in relation to following scenario:

  1. Our code publishes the events to the domain.
  2. The events are delivered to the configured web hook in the subscription.
  3. This works for a while.
  4. The consumer (who owns the web hook endpoint) complains that he is not receiving some events but most are coming through.
  5. We look in the configured dead-letter queue and find that there are no events. It has been more than a day and hence all retries are already exhausted.
  6. Hence we assume that all events are being delivered because there are no failed delivery events in the metrics.
  7. We also make sure that we indeed submitted these mysterious events to the grid.
  8. But consumer insists about the problem and proves that there is nothing wrong with his side.
  9. Now we need to figure out if some of these events are being swallowed by the event grid.

How do I go about troubleshooting this scenario?

1
Two things come to mind.. mere suggestions as it's tough to answer this one exactly.. 1) Event Delivery Metrics. See the counts.. especially Delviery Succeeded, Failed etc. Detailed steps here - docs.microsoft.com/en-us/azure/event-grid/… 2) Make sure there isn't something wrong with your dead letter configuration. You can try to simulate a failed delivery by trying to send to one of your own endpoints which always returns error (e.g. 500, or any 400 series), then see if at least these intended failure events show up in configured dead letter location.Rohit Saigal
Also to make your life easier while simulating failed events, you could set the retry policy to try only once or twice (instead of default 30) and lower the event ttl. Detailed commands here - docs.microsoft.com/en-us/azure/event-grid/manage-event-deliveryRohit Saigal
See my response to Roman.Raghu

1 Answers

1
votes

The current version of the AEG is not integrated for Diagnostic settings feature which can be help very well for streaming the metrics and logs.

For your scenario which is based on the Event Domains (still in the public preview, see limits) can help an Azure Monitoring REST API, to see all metrics in the specific your Event Domain.

The valid metrics are:

PublishSuccessCount,PublishFailCount,PublishSuccessLatencyInMs,MatchedEventCount,DeliveryAttemptFailCount,DeliverySuccessCount,DestinationProcessingDurationInMs,DroppedEventCount,DeadLetteredCount

The following example is a REST GET request to obtain all metrics values within your event domain for specific timespan and interval:

https://management.azure.com/subscriptions/{mySubId}/resourceGroups/{myRG}/providers/Microsoft.EventGrid/domains/{myDomain}/providers/Microsoft.Insights/metrics?api-version=2018-01-01&interval=PT1H&aggregation=count,total&timespan=2019-02-06T07:58:12Z/2019-02-07T08:58:12Z&metricnames=PublishSuccessCount,PublishFailCount,PublishSuccessLatencyInMs,MatchedEventCount,DeliveryAttemptFailCount,DeliverySuccessCount,DestinationProcessingDurationInMs,DroppedEventCount,DeadLetteredCount

Based on the response values, you can see metrics of the AEG behavior from the publisher side and the event delivery to the subscriber. For your production version, I do recommend to use a polling technique to obtain all metrics from AEG and pushing them to the Event Hub for a streaming analyzing, alerting, etc. Based on the query parameters (such as timespan, interval, etc.), it can be close to the real-time. When the Diagnostic settings will be supported by AEG, than this polling and publishing all metrics is obsoleted and small modification at the analyzing stream job can be continued.

The other point is to extend your eventing model for auditing part. I do recommend the following:

  1. Add a domain scope subscription to capture all events in the event domain and push them to the Event Hub for streaming purposes. Note, that any published event within that event domain should be in this published stream pipeline.

  2. Add a storage subscription for dead-letter messages and push them to the same Event Hub for streaming purposes.

  3. (optional) Add the Diagnostic settings (some metrics) of the dead-letter storage to the same Event Hub for streaming purposes. Note, that the dead-letter message is dropped after 4 hours trying to store it in the blob container. There is no any log message for that failed process, just only metric counter.

For the customer side, I do recommend that each subscriber will create a log message (aeg headers + event message) for auditing and troubleshooting purposes. It should be stored in the blob container or locally and then uploaded, etc. The point is, that this reference can be very useful for analyzing stream job to quickly figure out where is the problem.

In addition to your eventing model, your publisher should periodically (for instance once per hour) probes the event domain endpoint and also should send a probe event message to the probe topic for test purposes. The event subscription for that probe topic will configure a deadlettering option. The subscriber webhook handler should be always failed with a error code = HttpStatusCode.BadRequest such as no retrying action. Note, that there is a 300 seconds delay time, when the deadletter message will be stored in the storage. In other words, after probe event + 5 minutes, the deadlettering message should be in the stream pipeline. This probe scenario in your eventing model will probe a functionality of the AEG from the publisher and delivery point of the view.

The above described solution is shown in the following screen snippet:

enter image description here