How cassandra handles new node ( how tokens are redistributed)?

Question

I recently started exploring Cassandra for a new project. Here is my understanding of Cassandra as of now (based on numerous blogs available online)

Cassandra (C*) is AP (CAP theorem), i.e. it is highly available and partition tolerant
C* is logically ring topology. Not in real networking sense (as all nodes can speak to each other) but in data replication sense (data is replicated to neighboring nodes)
C* uses Murmur3 Partition algorithm to map out Partition keys to integer range ( roughly -2^63 to 2^63)
Every node is responsible for a few ranges (Vnodes enabled), i.e. node1: [-2^63 to -2^61], [2^31 to 2^33] .... node2: [-2^61 - 1 to 2^59], [2^33 + 1 to 2^35]....

Now my questions are, suppose I create a cluster with 3 nodes and RF = 2, then

In this case whole token range (I believe I am using the right terminology here) -2^63 to 2^63 will be distributed evenly in these 3 nodes?
What happens if I add another node in this running cluster? My assumption is that, C* will rebalance the ring because it uses consistent hashing and thus the range -2^63 to 2^63 will be re-distributed and corresponding data will be copied to new node. (Copied data from existing node won't be deleted till a repair happens?)
What happens when a node goes down? Does C* rebalanced the ring by re-distributing the tokens and moving data around?
Are tokens a fancy word for Murmur3hash(partition Key)? And what exactly it means by partition range? Is it like partition range1= -2^63 to -2^61, partition range2 = -2^61 + 1 to -2^59 and so on.
What information is actually shared during gossip?

Sorry if these questions seem very basic but I spent quite some time but could not find definitive answers.

I found a few helpful Q/A on SO but these do not answer my questions definitively. Like this one here[0] talks about rebalancing but I still do not understand if C* re-distributes the ranges? [0] : stackoverflow.com/questions/48735530/… — Abhijeet Shukla

Gunwant Gunwant · Accepted Answer · 2021-05-20T11:35:59

I will try to explain in simple way

Cassandra provides a simple way for configuration, all the configuration is done in cassandra.yaml. You can also go through THIS to get a some picture of the partitioning in cluster.

Let's start with the basics, instead of using three nodes let's use only one node for now. With the default configuration of cassandra we get below values in cassandra.yaml file

num_tokens: 1
initial_token: 0

This means only one node and all the partitions will reside on this one node. Now, the concept of virtual node is, in simple terms cassandra divides the tokens into multiple ranges, even though there are no physical nodes. Now, how to enable the virtual nodes feature in configuration file cassandra.yaml. The answer is num_token value.

num_tokens: 128
#initial_token: 0

This configuration makes 128 token ranges, for example 0-10, 11-20, 20-30 and so on. Keep the value of initial_token commented, this means we want cassandra to decide value of initial token (One less thing to worry about).

Now lets add another node in to cluster. Below is the simple configuration of new node. Consider the first node IP as 127.0.0.1 and second node IP as 127.0.0.2 for simplicity.

num_tokens: 128
#initial_token: 0
seed_provider:
- class_name: org.apache.cassandra.locator.SimpleSeedProvider
  parameters:
      - seeds: "127.0.0.1, 127.0.0.2"

We have just added a new node to our cluster, node1 will serve as seed node. The num_token value is 128, that means 128 token ranges. The value of initial_token is commented, that means cassandra will decide the initial token and range. Data transfer will start as soon as the new node joins cluster.

For third node, configuration shall be as below -

num_tokens: 128
#initial_token: 0
seed_provider:
- class_name: org.apache.cassandra.locator.SimpleSeedProvider
  parameters:
      - seeds: "127.0.0.1, 127.0.0.2, 127.0.0.3"

So third node will share the few token ranges from node1 and few token ranges from node2.

I hope, we got answers of question 1 and question 2 till now. Let's move to our next tow questions.

When a node goes down, hinted-handoff helps Cassandra maintain consistency . Any one out of remaining 2 nodes keeps the hints of the data which supposed to be written on the node which is down. Once the node goes up, these hints will be replayed and data will be written on target node. There is no need to do repartitioning or rebalancing kind of fancy things. Hints are stored in a directory which can be configured in cassandra.yaml file. By default 3 hours of hints will be stored, that means a defected node should come up within 3 hours. This value is also configurable in cassandra.yaml file.

hinted_handoff_enabled: true
max_hint_window_in_ms: 10800000 # 3 hours
hints_directory: /home/ubuntu/node1/data/hints

Murmur3Partitioner calculates the hash by using partition key columns, let's make our peace with that. There are other practitioners as well like RandomPartitioner and ByteOrderedPartitioner.

Below is the sample output of gossip info - You can go through each field in below protocol data

        ubuntu@ds201-node1:~$ ./node1/bin/nodetool gossipinfo
    /127.0.0.1
      generation:1621506507  -- the tiem at this node is boot strapped.
      heartbeat:2323
      STATUS:28:NORMAL,-1316314773810616606 -----status of the node , NORMAL,LEFT,LEAVING,REMOVED,REMOVING.....
      LOAD:2295:110299.0                    -- Disk space usage
      SCHEMA:64:c5c6bdbd-5916-347a-ab5b-21813ab9d135  -- Changes if schema changes
      DC:37:Cassandra                       --- data center of the NODE
      RACK:18:rack1                         --- Rack of the within the datacenter 
      RELEASE_VERSION:4:4.0.0.2284
      NATIVE_TRANSPORT_ADDRESS:3:127.0.0.1
      X_11_PADDING:2307:{"dse_version":"6.0.0","workloads":"Cassandra","workload":"Cassandra","active":"true","server_id":"08-00-27-32-1E-DD","graph":false,"health":0.3}
      NET_VERSION:1:256
      HOST_ID:2:ebd0627a-2491-40d8-ba37-e82a31a95688
      NATIVE_TRANSPORT_READY:66:true
      NATIVE_TRANSPORT_PORT:6:9041
      NATIVE_TRANSPORT_PORT_SSL:7:9041
      STORAGE_PORT:8:7000
      STORAGE_PORT_SSL:9:7001
      JMX_PORT:10:7199
      TOKENS:27:<hidden>

Gossip is the broadcast protocol spread data across the cluster. No one is a master in cassandra cluster, peers spread data among themselves which helps them to maintain latest information. Nodes communicates with each other randomly using gossip protocol (there is some criteria in this randomness). Gossip spreads node metadata only and not the client data.

Hope this clears some doubts.

How cassandra handles new node ( how tokens are redistributed)?

2 Answers