0
votes

When using event sourcing with aggregates as transaction scope you'd obviously prefer having that aggregate on a single machine. But if you also want to build a highly available and horizontally scalable system you'd also want to replicate this state on many machines on different databases.

If only allowing one write side on one machine in this network at any given moment, the other machines can be eventually consistent read sides. But to maximize write performance I guess it would be better to allow multiple write sides at the same time. But how is consistency and consensus handled in a system like this?

When two or more machines wants to update the common but replicated state concurrently, how do I make sure that the commands are handled by all write sides and in the same order, so that the resulting events are identical and also have the same order? Is a Lamport clock part of the solution?

2

2 Answers

0
votes

But to maximize write performance I guess it would be better to allow multiple write sides at the same time. But how is consistency and consensus handled in a system like this?

In event-sourced system the consistency on the write side is always strong. This is enforced by the aggregates and the Event store by using optimistic locking: in case of a concurrent write (in fact events are only appended to the store) the hole command is retried. This is possible because aggregate command methods are pure (side effects free) methods. As long as the events are not persisted the command can be retried.

When two or more machines updates the state at the same time (which one to choose and persist?)

Both. The first (always there is a first) command generate events that are persisted to the store. The secons command fails because of a low level concurent exception . Then it is retried by loading+applying all previous events, inclusiv those generated by the first command. Then the second command generate aditional events that are also persisted or throw an exception if the new state does not permit the second command to be handled.

You must notice that the second command is executed at least twice but each time the previous events (thus the state) are different.

The infrastructure keeps an aggregate version attached to each aggregate stream. Each event appending increase this version. There is a unique constraint on the aggregate id and version. This is probably how all the event stores are implemented.

When a machine misbehaves (unknowingly or knowingly) and propagates faulty events to the rest of the network (how to detect this?)

I don't see how this could happen but if it is happening then it really depends on your understanding of a faulty event. You could have some Sagas/Process managers that analize the events and trigger some emails that are sent to a supervisor of some kind.

0
votes

The way I've handled this in my Shuttle.Recall (shameless plug) ES implementation is to build a unique clustered index on the aggregate id and the version in the event store. In this way multiple writes on the same AR can never overlap and one of the two is going to "lose". Granted, this is only going to work using a central data store but perhaps your implementation has a similar mechanism available.

There is no constraint on how many clients can write to the event store simultaneously. However, the projection processing has to be a single thread per named projection on a single machine since the event ordering is so sensitive. Well, different projections can be processed on different machines I guess.