Logic exception in a buggy incoherent stream of events in DDD, CQRS, EventSourcing?

Question

Say you approach DDD with EventSourcing.

We all know events are immutable, and they should never be deleted from the event-log. But what if the stream is logically "incorrect"? Not that classical case that "I added money, I didn't have to add it, so create a compensating event to withdraw it.".

I'm not talking about runtime exceptions but logical exceptions you might find in the event stream because coders made bugs in the event writers.

Question

How do you "replay" an event stream if the software that wrote it contained bugs that violated the domain logic?

Oookay... We all know that "that should never have happened" and "fire the coders that wrote those event writers" and so on...

But let's assume that the event stream is just there and you are rebuilding the projections replaying all the stream. Just it could have happened and you are told to rebuild the projections from the existing event-stream.

And suddently, when replaying the event stream you find "incoherent" events that do not fit either the current business rules, either the rules that existed then.

Example 1

You have these events:

#  TimeStamp  Event          Data
------------------------------------------------------
1  03/jul     car.created    { id: 4444, color: blue }
2  14/jul     car.delivered  { id: 4444, to: Alice }
3  18/jul     car.created    { id: 5555, color: blue }
4  22/jul     car.created    { id: 5566, color: orange }
5  25/jul     car.created    { id: 5577, color: blue }

On 26/jul someone asks: "How many blue cars do you have in stock?".

Crystal clear: 2 units (ids 5555 and 5577).

Reason: Unit 4444 was sold. Unit 5566 is orange.

But what if you have this buggy sequence?

#  TimeStamp  Event          Data
------------------------------------------------------
1  03/jul     car.created    { id: 4444, color: blue }
2  14/jul     car.delivered  { id: 4444, to: Alice }
3  18/jul     car.created    { id: 5555, color: blue }
4  22/jul     car.created    { id: 5566, color: orange }
5  23/jul     car.created    { id: 5555, color: red }
6  25/jul     car.created    { id: 5577, color: blue }

Of course, event 5 should never have happened, you cannot create the same unit 2 times.

After investigating the domain experts... you discover event 5 is incorrect. It should read "car.repainted" but the software was buggy and wrote a "car.created".

Question for example 1:

Would you add new events numbered 7 and more, with timestamp "just after" event 5, to make some kind of compensation? Which events would you write?
Would you add new events numbered 7 and more, with timestamp "just before" event 5, to make some kind of signal to the replayer of "hey, ignore the next creation"? Which events would you write?
Would you rewrite your "replayers" so they can interpret that "anything before 25/jul that is a "double creation" means "car.repainted" and re-run the replayers to rebuild the aggregates?
Would you violate golden rules and "touch" the history? In fact it's not "history" because event "5" did not really happen. Can we touch it then?

Example 2

Let's assume a warehouse with a forklift to pick up things from shelves. The warehouse contains 2 vertical corridors, 2 horizontal corridors and 1 diagonal corridor.

All corridors are bidirectional, except for the left vertical one that has some kind of steps or whatever and the forklift can only move from A to C but not the reverse; and also except from the horizontal below which also has steps and the forklift can only move from D to C and never from C to D.

After purchasing the machine, you start everyday in spot A as the entry door to the warehouse is there. No matter how for this example at the end of the day the forklift just disappears, don't care.

The commands can be:

purchase()
start()
goRight()
goLeft()
goUp()
goDown()
cross()

The events can be:

purchased
started
wentRight
wentLeft
wentUp
wentDown
crossed

This is the possible state diagram of the forklift aggregate:

Let's assume you are replaying the events of the aggregate and you find those:

#   TimeStamp     Event
----------------------------------------------
1   12/jul 10:00  purchased
2   14/jul 09:00  started
3   14/jul 11:00  wentDown
4   14/jul 12:00  crossed
5   14/jul 14:00  wentDown
6   23/jul 09:00  started
7   23/jul 10:00  wentRight
8   23/jul 13:00  crossed

Someone asks "where's the forklift now? You can easily tell "C".

Reason: No matter what happened before 6 because event 6 resets to position A, event 7 moves towards B, event 8 moves towards C.

But what if you the sequence continues like this?

#   TimeStamp     Event
----------------------------------------------
[...]
6   23/jul 09:00  started
7   23/jul 10:00  wentRight
8   23/jul 13:00  crossed
9   23/jul 15:00  wentUp
10  23/jul 16:00  wentRight
11  27/jul 09:00  started
12  27/jul 11:00  wentDown

Some domain expert asks you "Hey geek guy, you told us eventsourcing was magical: Where was the forklift on 23/jul at 18:00?"

We all know that the lift cannot "jump" over the stairs, so we all know that event 9 could never happen.

So our "replayers" cannot do other thing that throw an exception. But the event sequence already written is that one.

The topic here is not "how to write a good sequence" but "what to do when you face a sequence with exceptions".

Questions for example 2:

Would you write a compensating event? How? Which? When?
Would you rewrite history? (ugly if you have millions of events)
How would you handle that exception from the point of view of the domain event replayers?

mmm.... so I'm actually wondering whether it may be an idea to have some status on an event in order to invalidate it but keep it around. In this way it won't be processed. Once may even go as far as having a system events of some sort that can "replace" events or do some other maintenance and have those events track such changes. I;m going to give that some though! In your first case event 5 will be invalidated and a new event 5 inserted. The relevant projection(s) would need to be rebuilt then; else also "updated"? Such a status would be an event store mechanism though. — Eben Roux
On second thought it may be another option to rather move out the event a separate store (table in DB speak) and just replace the inconsistent events. Then rebuild the relevant projections. — Eben Roux
During the last year I've been thinking if we should/n't have like a "Log_xxx_Events" table where the writer writes and another "Cache_xxx_Events" which just catches up the logger as fast as possible; where you just dump the log onto "but" correcting them. Not only this kind of "logic exception" but also formatings. Say that some events have UTF-8 chars converted to \uxxx while in other events they are not... Maybe it's a "temporal space" where to get the history ready to be read, and then rebuild aggregates from this "readable history". I'ts just internal thoughts... But I'm unsure. — Xavi Montero
That's actually a common problem when systems are not the book of record. I think there should always be 2 concepts, the system events order and the business events order. Aggregates should always be rehydrated based on the business event order. Having two events with the same business ordering should allow to shadow previous events (where the last system-written event wins). The business ordering should be a double so that you can insert events in-between other events. If event stores would support something like that these kind of problems would be easy to solve. Food for thoughts... — plalx

Constantin Galbenu Constantin Galbenu · Accepted Answer · 2018-08-31T06:14:25

How do you "replay" an event stream if the software that wrote it contained bugs that violated the domain logic?

You have at least two options:

Put some healing code in the Aggregate apply method or in the event subscribers (Readmodels, projections, Sagas); this piece of code should handle the exact situation that you are trying to avoid.

It has the disadvantage that it will exists forever in the code base but it has the advantage that can be done with zero downtime.

Migrate the Event store. Greg Young has a book about how to do it. Basically you create another event stream on possible another Event store instance, you process every event from the event stream, repair the anomaly and append to the new event stream. After the migration is done you replace the old event store with the new event store.

It has the disadvantage that you probably need some downtime when replacing the event stores but it has the advantage that after it is done you can "forget" the mistake, you will have a clean/correct event stream.

Would you write a compensating event? How? Which? When?

Writing compensation events is handy when you want a fast solution; this is a particular case of solution no. 1.

Would you violate golden rules and "touch" the history? In fact it's not "history" because event "5" did not really happen. Can we touch it then?

You can do that, I surely did because I wanted the fastest solution possible but it can get ugly depending on the framework/technology. For example, the subscribers cannot be sure anymore that they processed all the relevant events from the event store so, in order to be sure, you need to rebuild all Readmodels; the biggest problems you would have with Sagas because processing events has side-effects.

Regarding the legal aspect, if you do modify the history you would need to archive the old event stream just in case someone asks for it. This depends on your domain.