1
votes

We are using Apache-Cassandra v3.0.9 and have 3 DC. We are experiencing continuous troubles while running nodetool repair and most of the time the repair process causes big outages. We have 3 different datacenters consisting of 4, 4 & 15 nodes. The total data is around 200 GB at RF=3 and we are using LCS. The RAM is 16 GB, out of which 6 GB is dedicated as heap. Most of the times we try to run full repair the repair process fails with long GC pauses and node becoming unresponsive. Other than at the time of repair our nodes are good on heap and GC pauses are hardly 300 ms. I have following doubts.

  1. Is it still required to run full repair before gc_grace_seconds or just the incremental repairs are good enough in apache cassandra v3.0.9

  2. Do I need to run incremental sequential repairs on every node of the cluster, any one node of each of the datacenters or just any node of the whole cluster? One-by-one or concurrently?

  3. What are the downsides of repair failing because some nodes became unresponsive/died during the repair process, any steps to take care of before starting another repair session.

  4. What are the downsides of not scheduling repairs at all?

  5. We started our cassandra deployment straight away on version 3.0.9. Is the migration as mentioned on Apache Cassandra documentation still required?

1

1 Answers

0
votes
  1. full repair is still needed. incremental repair would split SSTables into "repaired" and "unrepaired" parts and the "repaired" part will not be repaired later and that's why incremental repair is more efficient. However, if there're data corruption in the "repaired" sstables, only full repair can fix that; our experience is to run incremental repair every day and full repair only once per grace period. Also, when you have incremental repair, you can make the grace period longer.

  2. better run incremental repair on each node one by one; you can have a cron job or code a simple scheduler to do that;

  3. repair failure, just run it again; no side effect I know of.

  4. if you don't do repair, as time goes, you data consistency is at danger; Cassandra takes the eventual consistency concept, which means that it doesn't guarantee strong consistency when you write data to it unless you explicitly specify that. repair is very important to guarantee the data in the background are all kept up to date and consistent;

  5. if you run full repair already in your cluster, you shouldn't need to migrate explicitly.