14
votes

What happens when leader crashes before all followers updates the commit index?

For example, node A, B, C forms the cluster:

  • only A and B alive and A is leader

  • A replicates an entry (let's say it's entry1) to B and get successful result from B

  • A commits entry1, and crashes before it send out the heartbeat message to B (which would results in B updates its commit index)

  • C online now

My questions is:

  • would C be elected as new leader? If so, then entry1 is lost? Moreover, if A re-joins later, its data would be inconsistent with others?

I know the raft spec said:

Raft uses a simpler approach where it guarantees that all the committed entries from previous terms are present on each new leader from the moment of its election, without the need to transfer those entries to the leader.

But here entry1 may be not considered as committed entry? because B has not get the confirm from old leader (heartbeat from leader). So C get chances to be new leader?

  • if B becomes new leader, then how it should deal with the entry1?
1

1 Answers

16
votes

It's important to note that an entry is considered committed once it's stored on a majority of servers in the cluster (there are technically some caveats to this but for this conversation we should assume this is the case) and not when a node receives a commit messages from the leader. If a commit message were required to consider an entry committed then every commit would require two round-trips - one for replication and one for commitment - and commit indexes would have to be persisted.

Instead, in your scenario, when A crashes and C recovers the Raft election algorithm will ensure that C cannot be elected leader and so C cannot drop the committed entry. Only B can be elected leader since it has the most up-to-date log. If C tries to get elected leader, it will receive only a rejected vote from B since B's log is more up-to-date than C's (it has the committed entry). Thus, what you'll see in practice is B will eventually be elected and will commit all entries from its prior term, at which time that entry will still be committed. Even if B were then to crash and A were to recover, A would still have a more up-to-date log than C and so it would again be elected leader.

When (not if) B becomes the leader, it will first ensure entries from the prior term are stored on a majority of servers before committing any entry from its current term. Typically this is done by committing a no-op entry at the beginning of the new leader's term. Essentially, the new leader commits a no-op entry, and once that entry is stored on a majority of servers it increments its commit index and sends the new commit index to all followers. So, that entry will not be lost. The new leader will ensure it is committed.

The caveats to considering an entry stored on a majority of the cluster to be committed are described in both the Raft paper and disseration.