1
votes

We have a three node cluster running in AWS environment. Nodes are located on different AZs for availability. All nodes are in a same VPC and in same security groups that allows all traffic between nodes. The snitch has been defined as Ec2Snitch. Cassandra version is 3.2.1.

What might be the reasoning for hints created every ten seconds in some of the nodes even if all nodes are up and running? The system.log is practically floated with messages like below. However, no related warnings or errors can be found in the system.log. The amount of data written to the cluster is currently very modest and loads are very low.

The issue came up since the version 3.2.1 is not deleting the crc32 files related to the hints correctly and we run out of inodes in our file system.

INFO  [HintsDispatcher:2] 2017-08-02 13:13:42,765 HintsDispatchExecutor.java:252 - Finished hinted handoff of file 4c3e3e47-fcc2-4bff-a3a7-e2560f024173-1501679605217-1.hints to endpoint 4c3e3e47-fcc2-4bff-a3a7-e2560f024173

Any ideas for further investigation of the root cause?

1
something to note unrelated to question, 3.2.1 is buggy and far from stable (this was very early in 3.x versions). You should really upgrade to 3.11.x branch.Chris Lohfink

1 Answers

0
votes

A good place to start looking is the gc logs, since its the most likely place to cause regular periodic dropped mutations and hints. A GC greater than the write timeout (or close to it) will likely cause it. The cause of the GCs is harder to determine but common causes include many tombstones, very wide (>100mb) partitions, or too many sstables from compactions getting behind, (can check in nodetool cfstats and compactionstats). Can start by just giving more heap space and seeing if it improves. Other solutions depend on the cause.

Can check tpstats for dropped mutations as well which will cause coordinator to write the hint and be delivered immediately if node is UP. It wont tell you cause but might be able to identify nodes causing more of them which you can then look into more (cpu load? disk? exceptions in logs?).