6
votes

I am just curious to understand the mechanism of replication in Cassandra. I read the Datastax link about data distribution:

http://www.datastax.com/docs/1.2/cluster_architecture/data_distribution

In the consistent hashing section it tells that Cassandra creates hash value for each primary key and based on that send the data to node that accommodates the generated hash value. After that it shows distribution of data in a cluster. Now my question is how it copies this data to other nodes in a cluster based on hash value.

This may be very basic question. Please explain by example if possible.

2

2 Answers

12
votes

The way replicas are found depends on replication strategy. For the SimpleStrategy with replication factor N without virtual nodes Cassandra does the following:

  1. Hash the key
  2. Find the node with smallest token greater than or equal to the hash, wrapping around if necessary
  3. Store the key on that node and the next N-1 nodes in token order

As an example, suppose your nodes have tokens 0, 10, 20, 30 and your replication factor is 2. If your key has hash 14 then it will be stored on the nodes with tokens 20 and 30. If your key has hash 28 then it will be stored on the nodes with tokens 30 and 0.

If you use virtual nodes, the same idea is used but virtual nodes will be skipped as replicas if the physical node has already received the key.

If using NetworkTopologyStrategy, nodes are skipped if the quota for that data center is reached.

0
votes

I learned about the distribution about cassandra with virtual node on site http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architectureDataDistributeDistribute_c.html. The bottom portion of the graphic,every virtual node has 3 replica in different phisical node,so is the replication stragety determined when the virtual node is assigned?