cassandra data redistribution when new nodes join

Question

I'm a beginner in Cassandra. I want to understand how the data gets (re)distributed when a new node joins an existing cluster.

Let us suppose, there were 100 row keys in a cluster of 10 nodes. Also, let us assume for simplicity that using a hash function the rows were evenly distributed to 10 nodes, i.e. node N1 has row keys from 1 to 10, node N2 has row keys 11 to 20 and so on.

Now, if a new node N11 joins the cluster, how is it possible to continue the data distribution over 11 nodes maintaining the same hash function? The reason is that the range of hash function was earlier limited to 10 nodes. And after the new node addition, the range of hash function needs to be changed.

Considering above scenario, how would a lookup for older record (when only 10 nodes were present) succeed?

G Quintana G Quintana · Accepted Answer · 2015-03-03T16:32:48

Prior to Cassandra 1.2, adding a node to the cluster meant splitting token ranges. For instance, with hash function producing values between 1 and 100:

Before: 1-10,11-20,21-30,31-40,41-50,51-60,61-70,71-80,81-90,91-100
After: 1-5,6-10,11-20,21-30,31-40,41-50,51-60,61-70,71-80,81-90,91-100

The first node gives a part of its token range to the new node (in bold).

Each node maintains a map of all nodes and knows which node handles which token ranges (including replicas). When a node is added/removed from the cluster, other nodes get informed of the change by gossiping with each other.

Since Cassandra 1.2, with the addition of virtual nodes, each node of the cluster gives a part of its token range to the new node. As a result each node keeps, more or less, the same token range width, and the same load.

cassandra data redistribution when new nodes join

1 Answers