How can a node with complete log can be elected if another becomes a candidate first?

Question

I've been watching Raft Algorithm video at https://youtu.be/vYp4LYbnnW8?t=3244, but am not clear about one circumstance.

In leader election for term 4, if node s1 broadcasts RequestVote before s3 does then node s2, s4 and s5 would vote for it, while s3 doesn't. And then node s3 broadcasts RequestVote to others, how can it get the vote of others?

One possible way to handle the situation I can figure out is:

if node s1 receives the rejection from s3, and found out s3's log is newer than itself, and do not set itself as leader even though it receives majority of votes
As to other nodes, they remember the leader information they voted, if a new vote request comes (with bigger <lastTerm, lastIndex>), they vote for the node with bigger <lastTerm, lastIndex>.

In both scenarios, eventually node s3 gets all others' votes, and sets itself as leader. I'm not sure if my guess is correct.

Michael Deardeuff Michael Deardeuff · Accepted Answer · 2018-09-10T22:25:46

(Before I comment, be aware that there is NO possible way for entry #9 to be committed. There is no indication of which log entries are committed, but this discussion works with any of #s 1-8 as being committed.)

In short, s3 does not become the leader, s1 does because it gets a majority of the votes. If your concern is that entry #9 will be lost, that is true, but it wasn't committed anyway.

From §5.3:

In Raft, the leader handles inconsistencies by forcing the followers’ logs to duplicate its own. This means that conflicting entries in follower logs will be overwritten with entries from the leader’s log.

To comment on your handling of the situation.

1, if node s1 receives the rejection from s3, and found out s3's log is newer than itself, and do not set itself as leader even though it receives majority of votes

It could do this, but it will make failover take longer because s3 would have to try again with a different timeout, and you come into a race condition where s1 always broadcasts RequestVote before s3 does. But again, it is always safe to delete the excess entries that s3 has.

The last paragraph of §5.3 talks about how this easy, timeout-based election process was used instead of ranking the nodes and selecting the best. I agree with the outcome. Simpler protocols are more robust.

2, As to other nodes, they remember the leader information they voted, if a new vote request comes (with bigger <lastTerm, lastIndex>), they vote for the node with bigger <lastTerm, lastIndex>.

This is strictly forbidden because it destroys leader election. That is, if you have this in place you will very often elect multiple leaders. This is bad. I cannot stress enough how bad this is. Bad, bad, bad.

How can a node with complete log can be elected if another becomes a candidate first?

1 Answers