When implementing the Raft algorithm, I found there is a situation that I think may or may not do harm to the cluster.
It is reasonable to assume some AppendEntriesRPC from Leader are received reordered(network delay or other reasons). Consider the Leader send a heartbeat AppendEntriesRPC to peer A, with prev_log_index = 1
, and then send another AppendEntriesRPC with entry 2, and then it crash(I ensure this happen immediately by a callback in my test). If the two RPCs are handled in the order which they are sent, entry 2 will be inserted successfully. However, if the heartbeat RPC is delayed, then peer A will firstly insert entry 1 and respond to the Leader. Then comes the delayed heartbeat, peer A will erase entry 2, because the entry conflict with the Leader's prev_log_index = 1
. So peer A erases a log entry by mistake.
To dig a little deeper, if the Leader doesn't crash immediately, will it fix this? I think if peer A respond to the delayed heartbeat correctly, the Leader will find out and fix it up in some later RPCs.
However, what if peer A's response to entry 2 lead to the commit_index
advancing? In this case peer A vote to advance commit_index
to 2, even though it actually does not have entry 2. So there may not enough votes for this advancing. When the Leader crashs now, a node with less logs will be elected as Leader. And I do encounter such situation during my testing.
My question is:
- Is my reasoning correct?
- If reordered RPC a real problem, how should I solve that? Is indexing and caching all RPCs, and force them be handled one by one a good solution? I found it hard to implement in gRPC.