How are nodes decided for replication in Cassandra

Question

I am trying to understand how exactly data is replicated on multiple nodes in Cassandra. Lets assume we have 6 nodes and replication factor is 3. For all simplicity, lets assume single datacenter and single rack. Since RF is 3,data is stored in 3 replicas. I want to understand how the 3 replicas are decided.

Referring to example in http://www.datastax.com/dev/blog/virtual-nodes-in-cassandra-1-2 (first image second part i.e, with virtual nodes), lets say our row falls under virtual node 'E' as decided by partitioner. So the row must be present in Node 1, 5, 6 according to distribution of virtual nodes among different nodes.

But coming to documentation - http://docs.datastax.com/en/cassandra/2.1/cassandra/architecture/architectureDataDistributeReplication_c.html , it says even in simple case of SimpleStrategy, first replica on a node is determined by the partitioner. Additional replicas are placed on the next nodes clockwise to the ring. So will data be stored in E, F, G virtual nodes or may be Node 1, 2, 3 ?

Which one is correct ? 1st link or documentation ?

Thanks!

Marko Švaljek Marko Švaljek · Accepted Answer · 2017-01-16T22:38:01

And if it really interest you where your partition data ends up in the cluster you can use:

nodetool getendpoints

https://docs.datastax.com/en/cassandra/2.1/cassandra/tools/toolsGetEndPoints.html

Please take into account that documentation is simplified so that people understand it easier when seeing for the first time. In reality it's consistent hashing on steroids.

Previously every node had a single token and tokens were boundaries on ring that was used for consistent hashing. Basically you had a whole range divided into number of nodes that you had in the cluster. When you needed to do an operation on some partition, you took partition key, hashed it and then you knew to which node to go to. Basically after hashing you get a number in a range of -2^63 to 2^63 - 1. Then you go clockwise on the ring until you "find" a marker and this is how you know to which node a partition belongs initially. If you have greater replication factor, you just continue going clockwise on the ring until you "find" all the nodes that you need to satisfy replication factor. And this is how you know what nodes in the cluster have your partition.

With virtual nodes there is a property num_tokens and every node selects that many random tokens (In range previously mentioned) when joining the ring and they are then used for consistent hashing. Basically every node then sees that new node wants to have portions of the ring and streams the data to it. Also when new writes comes in they are sent to the new node that is going to own them (until the node fully joins the ring, it's responses are ignored when counted up for consistency levels).

This is how it was before (single token per node in cluster):

This is how the ring looks like with virtual nodes:

Absolutely the same rules apply with virtual nodes and ordinary consistent hashing, you go around the ring to select the replicas. If during your going around the ring you stumble upon the same node again you just skip it and continue until you find all the nodes that own the data according to the replication factor that you desire.

How are nodes decided for replication in Cassandra

2 Answers