I am trying to understand data replication in Cassandra. In my case, I have to store a huge number of records into a single table based on yymmddhh primary key partition.
I have two data centers (DC1 and DC2) and I created a keyspace using below CQL.
CREATE KEYSPACE db1 WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy', 'DC1' : 1, 'DC2' : 1 };
And then created a new table tbl_data using below CQL
CREATE TABLE db1.tbl_data (
yymmddhh varchar,
other_details text,
PRIMARY KEY (yymmddhh)
) WITH read_repair_chance = 0.0;
Now, I can see that the above keyspace "db1" and table "tbl_data" created successfully. I have few millions of rows to insert, I am assuming that all rows will be stored on both servers i.e. DC1 and DC2 since replication factor is 1 of both data centers.
Suppose, after some time I need to add more nodes since number of records can increase to billions, so in that case one data center can't handle that huge number of records due to disk space limitation.
a) So, how can I divide data into different nodes and can add new nodes on demand?
b) Do I need to alter keyspace "db1" to put name of new data centers in the list?
c) How the current system will work horizontally?
d) I am connecting Cassandra using nodejs driver by using below code. Do I need to put ip address of all nodes here in code? What If I keep increasing the number of nodes on demand, do I need to change the code every time?
var client = new cassandra.Client({ contactPoints: ['ipaddress_of_node1'], keyspace: 'db1' });
From all above examples you can see that my basic requirement is to store a huge number of records into a single table spreading data to different servers where I should be able to add new servers if data volume increases.