1
votes

I have a 5 node cassandra cluster with 9 TB of data and planning to add 5 more nodes to the cluster. After adding new nodes load balancing will start and a subrange of partition keys will be mapped to new nodes. When exactly I should give nodetool cleanup. Whether giving nodetool cleanup immediately after starting new nodes will remove the older data belonging to that subrange from the old nodes in the cluster.

1

1 Answers

2
votes

The DataStax doc Adding nodes to an existing cluster mentions this:

  1. Start Cassandra on each new node. Allow two minutes between node initializations. You can monitor the startup and data streaming process using nodetool netstats.

  2. After all new nodes are running, run nodetool cleanup on each of the previously existing nodes to remove the keys no longer belonging to those nodes. Wait for cleanup to complete on one node before doing the next. Cleanup may be safely postponed for low-usage hours.

That would seem to indicate that you should run nodetool cleanup once all of the new nodes are up, running, and fully-bootstrapped. This process will remove old data from the sub ranges on the old nodes. As indicated, make sure to run nodetool cleanup on each old node, one node at a time.