Which node repair strategy to apply during bulk loads?

Question

Currently I am bulk loading 30TB of data into a ten-node cluster running Cassandra 2.1.2. I bulk load from flat files in stages of ~5 TB using 'sstableloader'.

I am aware, that it is required to run 'nodetool repair' periodically each Cassandra-node. But currently (at 10TB load) each node repair takes 48+ hours. There is a pressure to complete with the bulk load. So which repair strategy is better:

To nodetool repair each node in turn between each 5 TB stage?
To bulk load all 30TB and then start to repair?
To repair nodes simultaneously with sstableloader running?

Ideally I would need a tool to measure the need for repairs. A measure of the entropy. Does such a thing exist?

Are you about to bootstrap a new cluster? Or where do you see the "pressure" to run repairs while importing data? — Stefan Podkowinski

Stefan Podkowinski Stefan Podkowinski · Accepted Answer · 2015-03-23T19:24:00

Theres no real need to run repair between each import run if you're about to bootstrap your cluster with data. The sstableloader tool should take care that all replicas will be created correctly in the cluster. You can do a full repair after all imports have been finished. However, keep in mind the repair can only make sure data is replicated across the cluster in a consistent way. In case the loader did not save parts of the data at all - for whatever reason - the repair would not able to notice. So at some point you have to trust the tableloader or write your own script to validate the results.

Which node repair strategy to apply during bulk loads?

1 Answers