0
votes

On a six node Cassandra cluster [replication factor 2], we notice a single node being hotspotted [heavy load]. When Looking at tpstats, I could see that Flush writer and Replicate on write stages were having tasks that are all time blocked.

We have only one data directory [hence have configured cassandra to use only one flushwriter] and Queue size of memtable flush is 2

Heavily Loaded Node
Replicate-on-write-stage    32  4128    599249  48  371304
Flush-writer    0   0   85  0   24

Normal Node:
ReplicateOnWriteStage  0         0         753665         0      0
FlushWriter            0         0            137         0      25

Configuration of all the nodes are exactly the same and we use Murmur Partitioner.

Is there some other stats that I could refer to, to track down the CPU load issue and replicate on write stage blocked on single node?

Are these counters in tpstats a historical counter or do they refresh every N min?

From here it is mentioned that block can either because of IO not keeping or Huge rows and sorting [this increases cpu load]. Could the latter be the reason for unusual load in the one node out of entire cluster?

1
To be precise, tpstats alone cannot be used to get the solution for you. Can you get the netstats and compactionstats when this occurs?Moreover,do you see any CF flushing frequently ?Ananth

1 Answers

0
votes

Increasing your heap size should be the solution. In your logs, if you were to see long GC times being posted, GC pause times could be the culprit.

Could you also post your logs, so that we could find a better solution.