I am facing disk space issue for Cassandra. One of the keyspace is taking almost 25 GB space. As this table was containing huge data, I started cleaning up table and out of 100 million records, I cleaned 98 million records. In spite of data clean-up, still cassandra is taking 25 GB space.
To make sure that cassandra occupies disk space only for remaining 2 million records, I am trying implement following approach(I have cluster of 5 cassandra nodes, with replication factor set 3 for all keyspaces):
- Add 6th node to existing cluster and shut down 1(let's say first node) of the existing node(Here I am expecting that data from 1st node will be copied to newly added node as replication factor was set to 3 and one of this replica has been down)
- After some time(considering that copying data to new cassandra node will take some time), Repeat above steps for next 4 new nodes(So my cluster will contain new Cassandra 5 nodes which has data replicated from old cassandra nodes)
Is this correct approach to solve my problem? If this won't work or not a good solution, I would like to understand the reason and any alternative approach which is safe.
NOTE: I am using Cassandra 2.1.14