We are using Apache-Cassandra v3.0.9 and have 3 DC. We are experiencing continuous troubles while running nodetool repair and most of the time the repair process causes big outages. We have 3 different datacenters consisting of 4, 4 & 15 nodes. The total data is around 200 GB at RF=3 and we are using LCS. The RAM is 16 GB, out of which 6 GB is dedicated as heap. Most of the times we try to run full repair the repair process fails with long GC pauses and node becoming unresponsive. Other than at the time of repair our nodes are good on heap and GC pauses are hardly 300 ms. I have following doubts.
Is it still required to run full repair before
gc_grace_secondsor just the incremental repairs are good enough in apache cassandra v3.0.9Do I need to run incremental sequential repairs on every node of the cluster, any one node of each of the datacenters or just any node of the whole cluster? One-by-one or concurrently?
What are the downsides of repair failing because some nodes became unresponsive/died during the repair process, any steps to take care of before starting another repair session.
What are the downsides of not scheduling repairs at all?
We started our cassandra deployment straight away on version 3.0.9. Is the migration as mentioned on Apache Cassandra documentation still required?