3
votes

I've recently been reading up on messaging systems and have specifically looked at both RabbitMQ and NServiceBus. As I have understood it, if a message fails for some reason it is tried again immidiately a number of times. Both systems then offers the possibility to try again later, for example in 5 seconds. When the five seconds have passed the message is sent again a number of times.

I quote Vaughn Vernon in Implementing Domain-Driven Design (p.502):

The other way to handle this is to simply retry the send until it succeeds, perhaps using a Capped Exponential Back-off. In the case of RabbitMQ, retries could fail for quite a while. Thus, using a combination of message NAKs and retries could be the best approach. Still, if our process retries three times every five minutes, it could be all we need.

For NServiceBus, this is called second level retries, and when the retry happens, it happens multiple times.

Why does it need to happen multiple times? Why does it not retry once every five minutes? What is the chance that the first retry after five minutes fails and the second retry, probably just milliseconds later, should succeed?

And in case it does not need to due to some configuration (does it?), why do all the examples I have found have multiple retries?

2

2 Answers

5
votes

My background is NServiceBus so my answer may be couched in those terms.

First level retries are great for very transient errors. Deadlocks are a perfect example of this. You try to change the database, and your transaction is chosen as the deadlock victim. In these cases, a first level retry is perfect. Most of the time, one first level retry is all you need. If there is a lot of contention in the database, maybe 2 or 3 retries will be good enough.

Second level retries are for your less transient errors. Think about things like a web service being down for 10 seconds, or a SQL Server database in a failover cluster switching over, which can take 30-60 seconds. If you retry a few milliseconds later, it's not going to do you any good, but 10, 20, 30 seconds later you might have a good shot.

However, the crux of the question is after 5 first level retries and then a delay, why try again 5 times before an additional delay?

First, on your first second-level retry, it's still possible that you could get a deadlock or other very transient error. After all, the goal is usually not to make as slow a system as possible so it would be preferable to not have to wait an additional delay before retrying if the problem is truly transient. Of course there's no way for the infrastructure to know just how transient the problem is.

The second reason is that it's just easier to configure if they're all the same. X levels of retry and Y tries per level = X*Y total tries and only 2 numbers in the configuration file. In NServiceBus, it's these 2 values plus the back-off time span, so the config looks like this:

<SecondLevelRetriesConfigEnabled="true" TimeIncrease ="00:00:10" NumberOfRetries="3" />
<TransportConfig MaxRetries="3" />

That's fairly simple. Try 3 times. Wait 10 seconds. Try 3 times. Wait 20 seconds. Try 3 times. Wait 30 seconds. Try 3 times. Then you're done and you move on to an error queue.

Configuring different values for each level would require a much more complex config story.

3
votes

First Level Retries exist to compensate for quick issues like networking and database locks. This is configurable in NSB, so if you don't want them, you can turn them off. Second Level Retries are to compensate for longer outages. For example we use SLRs to compensate for a database that recycles every night at the same time.

The OOTB functionality increases the duration between SLRs because it assumes that if it didn't work the previous time, you will need more time to fix it. There exists a Retry Policy that is overridable, so you can change how the SLRs work.

In NSB, the FLRs always come first and SLRs don't come into play unless the transaction is still failing after FLRs. In addition, you can disable SLRs altogether and build your own custom Fault Manager which have additionally functionality. We have a process where we have a Fault Manager that sends issues to a staffed help desk, as that is the only way to solve a particular subset of issues.