1
votes

I want to save a domain event as a document in DocumentDB and then publish that event on the Azure Service Bus (ASB) for other services to pick it up. I want this two actions in one transaction, so that in case one of them fails, the other is rolled back automatically.

ASB provides transaction support for the incoming and outgoing (downstream) messages in a handler, i.e. those messages will not be sent if the handler fails AND the original message received in the handler will not be removed from the bus. But what about any data saved in the same handler, like in my case storing the event in DocumentDB? How can this be included in that transaction?

1

1 Answers

1
votes

Three thoughts:

  1. Make one or the other side of the whole transaction idempotent, do the other side first and if it passes try/retry until the other side passes. Idempotency means that you can redo a particular transaction many times and the effect will be the same as if you did it once. So, keep retrying until it passes. This is how my service bus and DocumentDB systems are designed.

  2. Remove the service bus from the design for these types of actions and use DocumentDB as your service bus. Then make sure you do all of the actions you want to be considered a single transaction in a single call to a stored procedure which will give it ACID all-pass or all-fail transaction guarantees. You may still be able to use the service bus for other things, just not these.

  3. Since the two systems are independent, the only other way I can think to do this is with compensating actions. For instance, you could try the DocumentDB side. If that involves more than one document or a read followed by a write of the same document(s), do that in a stored procedure which will give it ACID transaction guarantees. Be sure to compose the transaction in a way that you can reverse it (compensate for it). If the DocumentDB actions do not fail, then try the service bus action. If that fails, then execute the compensating transaction back on the DocumentDB.

Note, this 3rd option is still not a perfect solution. If, during the time it takes for you to try the service bus actions, something has been done on the DocumentDB side that makes it impossible to reverse that aspect in a consistent way, you will now have an inconsistent state. Be sure the system throws up a red flag if it occurs in a way that can't be ignored. It's up to you to model your system and determine how rare that is. Remember, living with exceedingly rare events is OK. Think of how GUIDs are composed. There is still a possibility that there will be a collision. It's just so rare that you don't need to worry about it.

Even if it's not "exceedingly" rare, you may still want to do it depending upon how rare and how damaging inconsistency is. Let's say, it's a 1 in 10,000 chance that the service bus action will fail and it's a 1 in 10,000 chance that the compensating transaction will fail. Then the chance of both occurring is 1 in 100,000,000. If you do 1,000,000 of these a month, it'll take you 50 months on median for the first one to occur and they will occur roughly 100 months apart. Then determine how bad is that. If it costs you $10,000 to fix one of these (paying a customer for SLA violation + labor to manually fix) can you afford that expense once every 100 months?

I would probably take the analysis a bit further and create a Monty Carlo simulation that tests the uncertainty in your model. You probably won't know if it's a 1 in 10,000 chance, you'll probably say, I'm 70% sure or 90% sure it's between 1/1,000 and 1/100,000. A Monte Carlo simulation will allow you to "roll the dice" in that range and produce a probability curve for the cost in the first year lets say. The output would say something like this, "There is a 10% chance that it'll cost us at least 100,000 in the first year; a 40% that it'll cost us at least 10,000 in the first year, and a 90% chance that it'll cost us at least $1,000 in the first year.

Most engineers are not used to probabilistic decision making like this but that's what a lot of my writing, speaking, and consulting is about and I've grown accustomed to helping engineering organizations learn to make decisions this way mostly in how they model load and in how they forecast when a particular scope will be complete but on occasion I do models like the one described above. The results give the team a much higher confidence in their decision.