timeouts on ReadRepairStage error messages

Question

We are using Apache Cassandra 3.11.4 .Recently we are seeing overloaded readrepair ERROR messages in the entire cluster because that we are getting timeouts ..I'm not able to find the root cause for this . Appreciate any inputs on this issue ..

ERROR [ReadRepairStage:2537] 2019-07-18 17:08:15,119 CassandraDaemon.java:228 - Exception in thread Thread[ReadRepairStage:2537,5,main] org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 1 responses. at org.apache.cassandra.service.DataResolver$RepairMergeListener.close(DataResolver.java:202) ~[apache-cassandra-3.11.3.jar:3.11.3] at org.apache.cassandra.db.partitions.UnfilteredPartitionIterators$2.close(UnfilteredPartitionIterators.java:175) ~[apache-cassandra-3.11.3.jar:3.11.3] at org.apache.cassandra.db.transform.BaseIterator.close(BaseIterator.java:92) ~[apache-cassandra-3.11.3.jar:3.11.3] at org.apache.cassandra.service.DataResolver.compareResponses(DataResolver.java:79) ~[apache-cassandra-3.11.3.jar:3.11.3] at org.apache.cassandra.service.AsyncRepairCallback$1.runMayThrow(AsyncRepairCallback.java:50) ~[apache-cassandra-3.11.3.jar:3.11.3] at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) ~[apache-cassandra-3.11.3.jar:3.11.3] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[na:1.8.0_212] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[na:1.8.0_212] at org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81) ~[apache-cassandra-3.11.3.jar:3.11.3] at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_212]

reduced dclocalreadrepair to 0.0

Carlos Monroy Nieblas Carlos Monroy Nieblas · Accepted Answer · 2019-07-19T00:51:46

Timeouts are a common issue while attempting repairs, and without more specifics of the errors, or your configuration, this will be a shot in the dark.

Repairs depend on disk space, as it will create temporary copies of files, as a rule of thumb the disk utilization should be lower than or equal to 50% to ensure that you'll have enough space.
Repairs can be delayed/aborted if the cluster is stressed, if that is the case, you may need to scale up the cluster to increase the available resources.
You may want to take a look in these other recommendations from Aaron regarding updates of the JVM settings in repairs.

Also note that since Cassandra 3.11.3, the settings read_repair_chance and dc_read_repair_chance were removed, as their names were misleading with the result obtained. Adding them won't have any effect.

timeouts on ReadRepairStage error messages

1 Answers