0
votes

I am trying to understand data replication in Cassandra. In my case, I have to store a huge number of records into a single table based on yymmddhh primary key partition.

I have two data centers (DC1 and DC2) and I created a keyspace using below CQL.

CREATE KEYSPACE db1 WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy', 'DC1' : 1, 'DC2' : 1 };

And then created a new table tbl_data using below CQL

CREATE TABLE db1.tbl_data (
        yymmddhh varchar,
        other_details text,
        PRIMARY KEY (yymmddhh)
    ) WITH read_repair_chance = 0.0;

Now, I can see that the above keyspace "db1" and table "tbl_data" created successfully. I have few millions of rows to insert, I am assuming that all rows will be stored on both servers i.e. DC1 and DC2 since replication factor is 1 of both data centers.

Suppose, after some time I need to add more nodes since number of records can increase to billions, so in that case one data center can't handle that huge number of records due to disk space limitation.

a) So, how can I divide data into different nodes and can add new nodes on demand?

b) Do I need to alter keyspace "db1" to put name of new data centers in the list?

c) How the current system will work horizontally?

d) I am connecting Cassandra using nodejs driver by using below code. Do I need to put ip address of all nodes here in code? What If I keep increasing the number of nodes on demand, do I need to change the code every time?

var client = new cassandra.Client({ contactPoints: ['ipaddress_of_node1'], keyspace: 'db1' });

From all above examples you can see that my basic requirement is to store a huge number of records into a single table spreading data to different servers where I should be able to add new servers if data volume increases.

2

2 Answers

2
votes

a) If you add new nodes to the data center, the data will be automatically shared between the nodes. With replication factor 1 and default settings, it should be ~50% on each node, though it might take a bit to redistribute data between the nodes after adding a new node. 'nodetool status ' can show you which node owns how much of that keyspace.

b) Yes, I do believe you have to (though not 100% on this).

c) Horizontally with your setup it'll scale linearly (assuming the machines are equal and have the same num_tokens value) by distributing data as according to 1 divided on number of nodes (1 node = 100%, 2 = 50%, 3 = 33%, etc.), both throughput and storage capacity will scale.

d) No, assuming the nodejs driver works like the C++ and Python drivers of Cassandra (it should!), after connecting to Cassandra it'll be aware of the other nodes in the cluster.

1
votes

Answer by dbrats answers most of your concerns.

Do I need to alter keyspace "db1" to put name of new data centers in the list?

Not needed. You want to alter only if you add a new Data center or change replication factor.

Do I need to put ip address of all nodes here in code?

Not needed. But adding more than one contact point ensure higher availability. In case your contact point is down, the driver can connect to the other. Once it connects, it can get all the list of nodes.