So many Service Bus Transient Errors?

Question

We have two windows services that live on a Corporate On-Premise Server and that continually send messages to Azure Service Bus in the cloud. Although the messages do end up on the service bus eventually, there are periods of time where the messages just seem to never make it through for a long stretch of time.

This is causing delay issues for us, as we depend on the message arriving onto the service bus and being processed within a minute. However, as can be seen below, a message can be 'blocked' for stretches of up to 30-40 minutes before making its way through to Azure Service Bus. This happens every day, and almost at some time during every hour.

The errors are mainly one of the following (example logs at end of this post):

A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond 191.239.XX.XXX:443
Error during communication with Service Bus. Check the connection information, then retry.
No such host is known
The request operation did not complete within the allotted timeout of 00:01:10. The time allotted to this operation may have been a portion of a longer timeout. TrackingId:f2db6377-e17d-401a-b339-11fbb51c7bf7, Timestamp:19/05/2017 12:47:36 AM

The way that we send messages to the service bus is as follows, simplified below:

private TopicClient _azureTopic;

...
<Begin Loop>

if (_azureTopic == null)
{
   var connectionString = "Endpoint=sb://mynamespace.servicebus.windows.net/;SharedAccessKeyName=managerfiddev;SharedAccessKey=AABBCCDDEEFFGGHHHASDFADFAadfadfdfz=EntityPath=mytopic";
   _azureTopic = TopicClient.CreateFromConnectionString(connectionString);
   _azureTopic.RetryPolicy = RetryPolicy.NoRetry;
}

var brokeredMessage = new BrokeredMessage(message.Message)
{
    MessageId = message.Id.ToString()
};
brokeredMessage.Properties["ReceivedTimestamp"] = DateTime.Now;

_azureTopic.Send(brokeredMessage);

<End Loop>

Note: There is a deliberate reason why we have a NoRetry policy. Without wanting to add too much noise to the question, the same message that failed will be tried again in the next iteration (it sends the message to subscribers in a round robin fashion).

Example log of errors during a small window of time.

20:31:51 Event.WindowsService Event.WindowsService::PublishAzureServiceBusTopicMessage() error trying to synchronise message with Azure. Message ID: 1191251
Error during communication with Service Bus. Check the connection information, then retry.

20:32:00 Event.WindowsService Event.WindowsService::PublishAzureServiceBusTopicMessage() error trying to synchronise message with Azure. Message ID: 1191251
No such host is known

20:32:00 RFID.WindowsService RFID.WindowsService::PublishAzureServiceBusTopicMessage() error trying to synchronise message with Azure. Message ID: 1930029
No such host is known

20:32:10 RFID.WindowsService RFID.WindowsService::PublishAzureServiceBusTopicMessage() error trying to synchronise message with Azure. Message ID: 1930029
No such host is known

20:32:10 Event.WindowsService Event.WindowsService::PublishAzureServiceBusTopicMessage() error trying to synchronise message with Azure. Message ID: 1191251
No such host is known

20:32:10 RFID.WindowsService RFID.WindowsService::PublishAzureServiceBusTopicMessage() error trying to synchronise message with Azure. Message ID: 1930029
No such host is known

20:34:00 RFID.WindowsService RFID.WindowsService::PublishAzureServiceBusTopicMessage() error trying to synchronise message with Azure. Message ID: 1930034
Error during communication with Service Bus. Check the connection information, then retry.

20:38:34 Event.WindowsService Event.WindowsService::PublishAzureServiceBusTopicMessage() error trying to synchronise message with Azure. Message ID: 1191269
Error during communication with Service Bus. Check the connection information, then retry.

20:38:51 RFID.WindowsService RFID.WindowsService::PublishAzureServiceBusTopicMessage() error trying to synchronise message with Azure. Message ID: 1930043
Error during communication with Service Bus. Check the connection information, then retry.

Unfortunately, we can't afford to wait on a retry policy. If message fails, we need to send the message to a second on-premise subscriber (and then later return to trying again to the first azure subscriber). — Stefan Zvonar
You can configure your client to retry only once and then send it back to the on-premise subscriber. What is the underlying reason not to have a retry policy ? — Thomas
We don't want to retry because we can't afford waiting to send the same message to our second subscriber. So in essence, in the loop it will send to Azure and then send to another WCF service on premise. If it fails to send to Azure, then it still must go on to sending to the WCF service (without any delay). We keep a track of what has been sent to azure and wcf seperately, so when the loop goes back to sending to azure, it will retry sending the same message that it had failed on. — Stefan Zvonar
retrypolicy only applies to transient fault so message wont be delivered twice. Also you can configure delay between retries — Thomas

Thomas Thomas · Accepted Answer · 2017-05-19T07:23:32

Service bus has native retry capabilities on Namespace Manager, Messaging Factory, and Client (see Retry guidance for specific services).

Because it is handling transient exception, you shouldn't have duplicated sent messages.

if you want to retry only once You can configure it like that:

var connectionString = "myconnectionstring";
var client = TopicClient.CreateFromConnectionString(connectionString);
client.RetryPolicy = new RetryExponential(minBackoff: TimeSpan.FromSeconds(2),
                                        maxBackoff: TimeSpan.FromSeconds(2),
                                        maxRetryCount: 1);

This should do the trick.

If you want to ensure deduplication, just google azure servicebus deduplication.

So many Service Bus Transient Errors?

Example log of errors during a small window of time.

1 Answers