Currently I am bulk loading 30TB of data into a ten-node cluster running Cassandra 2.1.2. I bulk load from flat files in stages of ~5 TB using 'sstableloader'.
I am aware, that it is required to run 'nodetool repair' periodically each Cassandra-node. But currently (at 10TB load) each node repair takes 48+ hours. There is a pressure to complete with the bulk load. So which repair strategy is better:
- To nodetool repair each node in turn between each 5 TB stage?
- To bulk load all 30TB and then start to repair?
- To repair nodes simultaneously with sstableloader running?
Ideally I would need a tool to measure the need for repairs. A measure of the entropy. Does such a thing exist?