Replacing existing nodes in cassandra cluster with new nodes

Question

I am facing disk space issue for Cassandra. One of the keyspace is taking almost 25 GB space. As this table was containing huge data, I started cleaning up table and out of 100 million records, I cleaned 98 million records. In spite of data clean-up, still cassandra is taking 25 GB space.

To make sure that cassandra occupies disk space only for remaining 2 million records, I am trying implement following approach(I have cluster of 5 cassandra nodes, with replication factor set 3 for all keyspaces):

Add 6th node to existing cluster and shut down 1(let's say first node) of the existing node(Here I am expecting that data from 1st node will be copied to newly added node as replication factor was set to 3 and one of this replica has been down)
After some time(considering that copying data to new cassandra node will take some time), Repeat above steps for next 4 new nodes(So my cluster will contain new Cassandra 5 nodes which has data replicated from old cassandra nodes)

Is this correct approach to solve my problem? If this won't work or not a good solution, I would like to understand the reason and any alternative approach which is safe.

NOTE: I am using Cassandra 2.1.14

No I haven't. Does it help? Somewhere I read that we should not run "nodetool compact" manually. — Shailesh

Mike Lococo Mike Lococo · Accepted Answer · 2017-04-10T14:14:27

You haven't provided enough information to really know what's going on, but some things to think about...

In order to provide eventual consistency in the face of failure, Cassandra can't delete data immediately. It must write NEW data first, called a tombstone, and then wait gc_grace_seconds before allowing the tombstone to be purged in the next compaction. What you haven't talked about doing, is reasoning through the impact of gc_grace_seconds on your tombstone. If your tombstones aren't old enough to be purged, neither node-replacement nor compaction will help you until gc_grace_seconds has passed (or you temporarily lower gc_grace_seconds during this maintenance, but that runs the risk of accidentally resurrecting data under certain circumstances if you experience a node outage during the maintenance).
Provided you have sorted out gc_grace_seconds vs tombstone age, a manual compaction will recover your disk space. If you're using size-tiered-compaction, it will also squish all your data into a single sstable... which then may not be compacted again for a very long time... leading to more space recovery issues down the road if you update/delete your data.
Switching to leveled compaction can help with space recovery issues. It uses more smaller sstables and guarantees that no more than a certain percentage of space will be occupied by old updates or reclaimable tombstones. Leveled compaction is more demanding on your disks, though, if you run your cluster "hot" in terms of write capacity, the switch may affect performance.
I think node-replacement is also a viable strategy for reclaiming disk, but I don't remember all the details of streaming to know for sure if it's going to pull over stale tombstones or compact them first... I THINK it compacts first. You might want to verify on a test-bed first, though.

Replacing existing nodes in cassandra cluster with new nodes

1 Answers