8
votes

Backgound:

In section 3, named Implementing a State Machine, of Lamport's paper Paxos Made Simple, Multi-Paxos is described. Multi-Paxos is used in Google Paxos Made Live. (Multi-Paxos is used in Apache ZooKeeper). In Multi-Paxos, gaps can appear:

In general, suppose a leader can get α commands ahead--that is, it can propose commands i + 1 through i + α commands after commands 1 through i are chosen. A gap of up to α - 1 commands could then arise.

Now consider the following scenario:

The whole system uses master-slave architecture. Only the master serves client commands. Master and slaves reach consensus on the sequence of commands via Multi-Paxos. The master is the leader in Multi-Paxos instances. Assume now the master and two of its slaves have the states (commands have been chosen) shown in the following figure:

Master and Slaves.

Note that, there are more than one gaps in the master state. Due to asynchrony, the two slaves lag behind. At this time, the master fails.

Problem:

  1. What should the slaves do after they have detected the failure of the master (for example, by heartbeat mechanism)?

  2. In particular, how to handle with the gaps and the missing commands with respect to that of the old master?

Update about Zab:

As @sbridges has pointed out, ZooKeeper uses Zab instead of Paxos. To quote,

Zab is primarily designed for primary-backup (i.e., master-slave) systems, like ZooKeeper, rather than for state machine replication.

It seems that Zab is closely related to my problems listed above. According to the short overview paper of Zab, Zab protocol consists of two modes: recovery and broadcast. In recovery mode, two specific guarantees are made: never forgetting committed messages and letting go of messages that are skipped. My confusion about Zab is:

  1. In recovery mode does Zab also suffer from the gaps problem? If so, what does Zab do?
4

4 Answers

2
votes

The gap should be the Paxos instances that has not reached agreement. In the paper Paxos Made Simple, the gap is filled by proposing a special “no-op” command that leaves the state unchanged.

If you cares about the order of chosen values for Paxos instances, you'd better use Zab instead, because Paxos does not preserve causal order. https://cwiki.apache.org/confluence/display/ZOOKEEPER/PaxosRun

The missing command should be the Paxos instances that has reached agreement, but not learned by learner. The value is immutable because it has been accepted a quorum of acceptor. When you run a paxos instance of this instance id, the value will be chosen and recovered to the same one on phase 1b.

When slaves/followers detected a failure on Leader, or the Leader lost a quorum support of slaves/follower, they should elect a new leader.

In zookeeper, there should be no gaps as the follower communicates with leader by TCP which keeps FIFO.

In recovery mode, after the leader is elected, the follower synchronize with leader first, and apply the modification on state until NEWLEADER is received.

In broadcast mode, the follower queues the PROPOSAL in pendingTxns, and wait the COMMIT in the same order. If the zxid of COMMIT mismatch with the zxid of head of pendingTxns, the follower will exit.

https://cwiki.apache.org/confluence/display/ZOOKEEPER/Zab1.0

1
votes

Multi-Paxos is used in Apache ZooKeeper

Zookeeper uses zab, not paxos. See this link for the difference.

In particular, each zookeeper node in an ensemble commits updates in the same order as every other nodes,

Unlike client requests, state updates must be applied in the exact original generation order of the primary, starting from the original initial state of the primary. If a primary fails, a new primary that executes recovery cannot arbitrarily reorder uncommitted state updates, or apply them starting from a different initial state.

1
votes

Specifically the ZAB paper says that a newly elected leader undertakes discovery to learn the next epoch number to set and who has the most up-to-date commit history. The follower sands an ACK-E message which states the max contiguous zxid it has seen. It then says that it undertakes a synchronisation phase where it transmits the state which followers which they have missed. It notes that in interesting optimisation is to only elect a leader which has a most up to date commit history.

With Paxos you don't have to allow gaps. If you do allow gaps then the paper Paxos Made Simple explains how to resolve them from page 9. A new leader knows the last committed value it saw and possibly some committed values above. It probes the slots from the lowest gap it knows about by running phase 1 to propose to those slots. If there are values in those slots it runs phase 2 to fix those values but if it is free to set a value it sets no-op value. Eventually it gets to the slot number where there have been no values proposed and it runs as normal.

In answer to your questions:

  1. What should the slaves do after they have detected the failure of the master (for example, by heartbeat mechanism)?

They should attempt to lead after a randomised delay to try to reduce the risk of two candidates proposing at the same time which would waste messages and disk flushes as only one can lead. Randomised leader timeout is well covered in the Raft paper; the same approach can be used for Paxos.

  1. In particular, how to handle with the gaps and the missing commands with respect to that of the old master?

The new leader should probe and fix the gaps to either the highest value proposed to that slot else a no-op until it has filled in the gaps then it can lead as normal.

0
votes

The answer of @Hailin explains the gap problem as follows:

In zookeeper, there should be no gaps as the follower communicates with leader by TCP which keeps FIFO"

To supplement:

In the paper A simple totally ordered broadcast protocol, it mentions that ZooKeeper requires the prefix property:

If $m$ is the last message delivered for a leader $L$, any message proposed before $m$ by $L$ must also be delivered".

This property mainly relies on the TCP mechanism used in Zab. In Zab Wiki, it mentions that the implementation of Zab must follow the following assumption (besides others):

Servers must process packets in the order that they are received. Since TCP maintains ordering when sending packets, this means that packets will be processed in the order defined by the sender.