After Reaper failed to run repair on 18 nodes of Cassandra cluster, I ran a full repair of each node to fix the failed repair issue, after the full repair, Reaper executed successfully, but after a few days again the Reaper failed to run, I can see the following error in system.log
ERROR [RMI TCP Connection(33673)-10.196.83.241] 2021-09-01 09:01:18,005 RepairRunnable.java:276 - Repair session 81540931-0b20-11ec-a7fa-8d6977dd3c87 for range [(-606604147644314041,-98440495518284645], (-3131564913406859309,-3010160047914391044]] failed with error Terminate session is called
java.io.IOException: Terminate session is called
at org.apache.cassandra.service.ActiveRepairService.terminateSessions(ActiveRepairService.java:191) ~[apache-cassandra-3.11.0.jar:3.11.0]
INFO [Native-Transport-Requests-2] 2021-09-01 09:02:52,020 Message.java:619 - Unexpected exception during request; channel = [id: 0x1e99a957, L:/10.196.18.230:9042 ! R:/10.254.252.33:62100]
io.netty.channel.unix.Errors$NativeIoException: readAddress() failed: Connection timed out
in nodetool tpstats
I can see some pending tasks
Pool Name Active Pending
ReadStage 0 0
Repair#18 3 90
ValidationExecutor 3 3
Also in nodetool compactionstats
there are 4 pending tasks:
-bash-4.2$ nodetool compactionstats
pending tasks: 4
- Main.visit: 1
- Main.post: 1
- Main.stream: 2
My question is why even after a full repair, reaper is still failing? and what is the root cause of pending repair?
PS: version of Reaper is 2.2.3, not sure if it is a bug in Reaper!