1
votes

I recently started exploring Cassandra for a new project. Here is my understanding of Cassandra as of now (based on numerous blogs available online)

  1. Cassandra (C*) is AP (CAP theorem), i.e. it is highly available and partition tolerant
  2. C* is logically ring topology. Not in real networking sense (as all nodes can speak to each other) but in data replication sense (data is replicated to neighboring nodes)
  3. C* uses Murmur3 Partition algorithm to map out Partition keys to integer range ( roughly -2^63 to 2^63)
  4. Every node is responsible for a few ranges (Vnodes enabled), i.e. node1: [-2^63 to -2^61], [2^31 to 2^33] .... node2: [-2^61 - 1 to 2^59], [2^33 + 1 to 2^35]....

Now my questions are, suppose I create a cluster with 3 nodes and RF = 2, then

  1. In this case whole token range (I believe I am using the right terminology here) -2^63 to 2^63 will be distributed evenly in these 3 nodes?
  2. What happens if I add another node in this running cluster? My assumption is that, C* will rebalance the ring because it uses consistent hashing and thus the range -2^63 to 2^63 will be re-distributed and corresponding data will be copied to new node. (Copied data from existing node won't be deleted till a repair happens?)
  3. What happens when a node goes down? Does C* rebalanced the ring by re-distributing the tokens and moving data around?
  4. Are tokens a fancy word for Murmur3hash(partition Key)? And what exactly it means by partition range? Is it like partition range1= -2^63 to -2^61, partition range2 = -2^61 + 1 to -2^59 and so on.
  5. What information is actually shared during gossip?

Sorry if these questions seem very basic but I spent quite some time but could not find definitive answers.

2
I found a few helpful Q/A on SO but these do not answer my questions definitively. Like this one here[0] talks about rebalancing but I still do not understand if C* re-distributes the ranges? [0] : stackoverflow.com/questions/48735530/…Abhijeet Shukla
Please only ask one question per question.Aaron

2 Answers

2
votes

I will try to explain in simple way

Cassandra provides a simple way for configuration, all the configuration is done in cassandra.yaml. You can also go through THIS to get a some picture of the partitioning in cluster.

Let's start with the basics, instead of using three nodes let's use only one node for now. With the default configuration of cassandra we get below values in cassandra.yaml file

num_tokens: 1
initial_token: 0

This means only one node and all the partitions will reside on this one node. Now, the concept of virtual node is, in simple terms cassandra divides the tokens into multiple ranges, even though there are no physical nodes. Now, how to enable the virtual nodes feature in configuration file cassandra.yaml. The answer is num_token value.

num_tokens: 128
#initial_token: 0

This configuration makes 128 token ranges, for example 0-10, 11-20, 20-30 and so on. Keep the value of initial_token commented, this means we want cassandra to decide value of initial token (One less thing to worry about).

Now lets add another node in to cluster. Below is the simple configuration of new node. Consider the first node IP as 127.0.0.1 and second node IP as 127.0.0.2 for simplicity.

num_tokens: 128
#initial_token: 0
seed_provider:
- class_name: org.apache.cassandra.locator.SimpleSeedProvider
  parameters:
      - seeds: "127.0.0.1, 127.0.0.2"

We have just added a new node to our cluster, node1 will serve as seed node. The num_token value is 128, that means 128 token ranges. The value of initial_token is commented, that means cassandra will decide the initial token and range. Data transfer will start as soon as the new node joins cluster.

For third node, configuration shall be as below -

num_tokens: 128
#initial_token: 0
seed_provider:
- class_name: org.apache.cassandra.locator.SimpleSeedProvider
  parameters:
      - seeds: "127.0.0.1, 127.0.0.2, 127.0.0.3"

So third node will share the few token ranges from node1 and few token ranges from node2.

I hope, we got answers of question 1 and question 2 till now. Let's move to our next tow questions.

When a node goes down, hinted-handoff helps Cassandra maintain consistency . Any one out of remaining 2 nodes keeps the hints of the data which supposed to be written on the node which is down. Once the node goes up, these hints will be replayed and data will be written on target node. There is no need to do repartitioning or rebalancing kind of fancy things. Hints are stored in a directory which can be configured in cassandra.yaml file. By default 3 hours of hints will be stored, that means a defected node should come up within 3 hours. This value is also configurable in cassandra.yaml file.

hinted_handoff_enabled: true
max_hint_window_in_ms: 10800000 # 3 hours
hints_directory: /home/ubuntu/node1/data/hints

Murmur3Partitioner calculates the hash by using partition key columns, let's make our peace with that. There are other practitioners as well like RandomPartitioner and ByteOrderedPartitioner.

Below is the sample output of gossip info - You can go through each field in below protocol data

        ubuntu@ds201-node1:~$ ./node1/bin/nodetool gossipinfo
    /127.0.0.1
      generation:1621506507  -- the tiem at this node is boot strapped.
      heartbeat:2323
      STATUS:28:NORMAL,-1316314773810616606 -----status of the node , NORMAL,LEFT,LEAVING,REMOVED,REMOVING.....
      LOAD:2295:110299.0                    -- Disk space usage
      SCHEMA:64:c5c6bdbd-5916-347a-ab5b-21813ab9d135  -- Changes if schema changes
      DC:37:Cassandra                       --- data center of the NODE
      RACK:18:rack1                         --- Rack of the within the datacenter 
      RELEASE_VERSION:4:4.0.0.2284
      NATIVE_TRANSPORT_ADDRESS:3:127.0.0.1
      X_11_PADDING:2307:{"dse_version":"6.0.0","workloads":"Cassandra","workload":"Cassandra","active":"true","server_id":"08-00-27-32-1E-DD","graph":false,"health":0.3}
      NET_VERSION:1:256
      HOST_ID:2:ebd0627a-2491-40d8-ba37-e82a31a95688
      NATIVE_TRANSPORT_READY:66:true
      NATIVE_TRANSPORT_PORT:6:9041
      NATIVE_TRANSPORT_PORT_SSL:7:9041
      STORAGE_PORT:8:7000
      STORAGE_PORT_SSL:9:7001
      JMX_PORT:10:7199
      TOKENS:27:<hidden>

Gossip is the broadcast protocol spread data across the cluster. No one is a master in cassandra cluster, peers spread data among themselves which helps them to maintain latest information. Nodes communicates with each other randomly using gossip protocol (there is some criteria in this randomness). Gossip spreads node metadata only and not the client data.

Hope this clears some doubts.

2
votes

Providing additional depth:

Token range assignment does have a random component to it, so it's not exactly "even." However, if you're using allocate_tokens_for_keyspace, the new token allocation algorithm takes over, and optimizes future range assignments based on the replication factor (RF) of the specified keyspace.

Here is a six line section of abbreviated output from nodetool ring, from a 3 node cluster built with num_tokens=16. Note that the "range" is effectively the defined starting token (hash) all the way until the next starting token minus one:

10.0.0.1  -6595849996054463311
10.0.0.2  -5923426258674511018
10.0.0.2  -5194860430157391004
10.0.0.2  -4076256821118426122
10.0.0.2  -3750110785943336998
10.0.0.3  -3045824679140675270

Watch what happens when I add a 4th node:

10.0.0.1  -6595849996054463311
10.0.0.2  -5923426258674511018
10.0.0.4  -5711305561631524277
10.0.0.2  -5194860430157391004
10.0.0.4  -4831174780910733952
10.0.0.2  -4076256821118426122
10.0.0.2  -3750110785943336998
10.0.0.4  -3659290179273062522
10.0.0.3  -3045824679140675270

Notice that the starting token ranges for each of the three original nodes remain the same. But in this particular case, ranges on 10.0.0.2 were bisected with the latter half assigned to 10.0.0.4.

Note that once streaming completes, data in those ranges on 10.0.0.4 is still on 10.0.0.2. That's by design, should the bootstrap process for the new node fail. Once things are stable, you can get rid of that data by running a nodetool cleanup on the original three nodes.

What happens when a node goes down? Does C* rebalanced the ring by re-distributing the tokens and moving data around?

This happens when a nodetool removenode is run. But for a node simply going down, "hints" are stored on the remaining nodes to be replayed once the down node comes back.