2
votes

We are seeing exceptionally strange behaviour on our consumption plan Function App please with regards to the following exceptions we are seeing repeatedly:

  • Microsoft.Azure.EventHubs.RecieverDisconnectedException (New receiver with higher epoch of '2' is created hence current receiver with epoch '1' is getting disconnected.)
  • System.Net.WebException (Exception of type 'Microsoft.ServiceBus.Messaging.LeaseLostException' was thrown.)

We get these exceptions whenever we stress the functions i.e. go from 0 to 50,000 events in a matter of moments but they are pegged to the cloud_role matching our Function App.. which would lead me to believe that it is a host error..

Reading various doco i.e. (https://docs.microsoft.com/en-us/azure/event-hubs/event-hubs-features), i think i understand how the EventHub receiver is meant to be working [but honestly i am reading between the lines as it's quite unclear] in that for my one receiver relies on a consumer group to manage reading batches of messages from the EventHub partitions (of which i am using 32).

My hypothesis was that under load, there were too many function instances for the single consumer group to 'cope' with, and it was simply repeatedly switching out the leases of partitions... however, in my testing scenario, i removed all logic from functions apart from relaying messages between event hubs, and the errors persisted even with only 4 partitions on the EventHub

In a desperate bid to see if was resolved in later versions, i mocked up exactly the same functionality in Functions v2, and receive what i assume is .net core equivalent..

  1. Microsoft.Azure.EventHubs.RecieverDisconnectedException (New receiver with higher epoch of '2' is created hence current receiver with epoch '1' is getting disconnected.)
  2. Microsoft.WindowsAzure.Storage.StorageException (The lease ID specified did not match the lease ID for the blob.)
  3. System.ArgumentOutOfRangeException (Ignoring out of date checkpoint with offset 1184072/sequence number 1038 because..)

So, can someone please

  • explain what on earth is actually going on under the covers
  • help me to suppress these, if they are not actual 'real' errors, and they are just the host managing things...

These exceptions are really annoying because it makes it quite tricky to actually see genuine un-handled exceptions.

1

1 Answers

6
votes

These are spurious errors that are due to the dynamic scale out/in of your Function living inside the Function App (host process) and you can ignore them.

Understandably, the fact that they are showing up in your logs is alarming, and we have begun some work in suppressing some of the errors (see https://github.com/Azure/azure-webjobs-sdk/issues/1760). This was released with version v1.0.11913 and you should be seeing them as warnings. Kindly file an issue if they are still showing up as errors.


Additional background on why you are seeing these exceptions

Let's start with some preliminaries on how EventHub scaling works as noted on this post: https://stackoverflow.com/a/42911842/6465830

1. Microsoft.ServiceBus.Messaging.LeaseLostException

Each time a scale out operation succeeds, EventHub redistributes the partition leases among the (1..N) group of EventProcessorHosts that successfully managed to get a lease on the partitions, where N is the number of partitions for your EventHub. For instance, if you start with only Function_0 and it manages to grab a least on all 10 partitions, when we scale out to Function_1 and EventHub decides to evenly distributed messages between both Functions, then Function_0 will lose the leases to 5 of the partitions. This behavior explains the Exception of type 'Microsoft.ServiceBus.Messaging.LeaseLostException' was thrown that you are seeing.

2. Microsoft.Azure.EventHubs.ReceiverDisconnectedException

In addition, Azure Functions also scales out to >N instances, so there will be a set of N+1...M, where M is the total number of scaled out instances that will not be able to get a lease on any partition. The side-effect of this is there will always be an EPH ready to quickly grab a lost lease to keep the pipeline going. This explains the New receiver with higher epoch of '2' is created hence current receiver with epoch '1' is getting disconnected. that you are seeing. Again, you are only charged when your Function executes, so the fact that there is some over-provisioning here will not affect your billing.