1
votes

I am trying to understand how a distributed storage system built on Raft filter duplicate requests even after client session expiration.

I have gone through the Raft dissertation chapter 6.3 which talks about how LogCabin (which is a distributed storage system built on Raft) filters the duplicate client requests by maintaining the client sessions. It maintains client sessions to keep track of requests that are applied to the leader's state machine and stores the result against the clientID and SessionID in a cache. So when the same request is issued by the client in the same session it will simply return the response from the cache. Whenever the Cluster don't hear any requests from the client for an hour, it will expire the client sessions.

Lets say, client ( Id = 1 ) issued a request to increment the inventory count of a product to 3 in the current active session (sessionId = 123 ) as below to the cluster:

{ ProductName = iPod, Count = INC(3) } 

Leader received the client request and replicated it to the majority of the followers, applied it to the state machine and cached the results so that if the same request is issued by the client again it can simply return the cached result as its a duplicate request.

Due to some reason client didn't receive the "success" response from the cluster.

The same Client( Id = 1 ) went inactive for an hour, So leader expired the client session and invalidated the cache entries for this client.

Client comes back again after an hour of inactivity and issues the same request ( duplicate ) to process it again. In this case raft will create a new session for the client.

Now the question is how cluster can still filter the duplicate requests when client joins with a new session and tries to execute the same request which was issued in his previous session ?

Raft Dissertation chapter 6.3 talks below as one solution :

The second issue is how to deal with a client that continues to operate after its session was expired. We expect this to be an exceptional situation; there is always some risk of it, however, since there is generally no way to know when clients have exited. One option would be to allocate a new session for a client any time there is no record of it, but this would risk duplicate execution of commands that were executed before the client’s previous session was expired. To provide stricter guarantees, servers need to distinguish a new client from a client whose session was expired. When a client first starts up, it can register itself with the cluster using the RegisterClient RPC. This allocates the new client’s session and returns the client its identifier, which the client includes with all subsequent commands. If a state machine encounters a command with no record of the session, it does not process the command and instead returns an error to the client. LogCabin currently crashes the client in this case (most clients probably wouldn’t handle session expiration errors gracefully and correctly, but systems must typically already handle clients crashing).

So as per above, the way to handle it is to check if the client session was existing and was his session expired due to inactivity.

I am finding it difficult to understand how this will solve to filter duplicate requests issued by client in the new session after his previous session was expired. Would like to also know how this kind of issues is handled in other distributed systems.

Thanks in advance.

1
It's important to clarify what a "duplicate request" here is and what is the nature of that request is?Vishrant
Let's say the request is to increment the inventory count of a particular product "iPOD" { ProductName = iPod, Count = INC(3) }. This request gets processed by leader and is applied to state machine. Due to some reason client didn't receive the "success" response from the cluster for this request. He again tries to execute the same request which is duplicate and should not be applied to the state machine.Yathish Manjunath

1 Answers

0
votes

What you are pointing out is not a network-partition problem, and if the client is disconnected and makes another request then the system should be idempotent. As the request was already made to the cluster (that follows the consensus) the integrity of the system is still maintained. And it's the client's responsibility to handle the response to duplicate request accordingly.

But in case of a network partition, and, if the request is still in flight and the client node disconnected then the system can follow:

  • "Shot in the head" techniques to remove that node from the cluster. One such technique could be, to remove the node from the network so it does not do any harm.