On a six node Cassandra cluster [replication factor 2], we notice a single node being hotspotted [heavy load]. When Looking at tpstats, I could see that Flush writer and Replicate on write stages were having tasks that are all time blocked.
We have only one data directory [hence have configured cassandra to use only one flushwriter] and Queue size of memtable flush is 2
Heavily Loaded Node
Replicate-on-write-stage 32 4128 599249 48 371304
Flush-writer 0 0 85 0 24
Normal Node:
ReplicateOnWriteStage 0 0 753665 0 0
FlushWriter 0 0 137 0 25
Configuration of all the nodes are exactly the same and we use Murmur Partitioner.
Is there some other stats that I could refer to, to track down the CPU load issue and replicate on write stage blocked on single node?
Are these counters in tpstats a historical counter or do they refresh every N min?
From here it is mentioned that block can either because of IO not keeping or Huge rows and sorting [this increases cpu load]. Could the latter be the reason for unusual load in the one node out of entire cluster?