1
votes

We are suddenly observing high write latency in metrics for one table (devices).

This is a tiny table with <100 entries where we update a field regulary.

This is on a 3 node cluster with RF=3. Each node has 8GB ram. We are running Cassandra 3.11.4 in docker.

There is nothing unusual in logs. The application is running smoothly as well.

nodetool tablehistograms

Percentile  SSTables     Write Latency      Read Latency    Partition Size        Cell Count
                              (micros)          (micros)           (bytes)                  
50%             0.00            263.21              0.00               258                17
75%             0.00           1131.75              0.00               372                20
95%             0.00          12108.97              0.00               642                29
98%             0.00          25109.16              0.00               642                35
99%             0.00          43388.63              0.00               642                35
Min             0.00              8.24              0.00                51                 0
Max             0.00         155469.30              0.00               770                35

nodetool status

Datacenter: datacenter-prod
===========================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address      Load       Tokens       Owns (effective)  Host ID                               Rack
UN  10.164.0.23  2.62 GiB   256          100.0%            e7e2a38a-d4f3-4758-a345-73fcffe26035  rack1
UN  10.164.0.24  2.61 GiB   256          100.0%            0c18b8e4-5ca2-4fb5-9e8c-663b74909fbb  rack1
UN  10.164.0.58  2.62 GiB   256          100.0%            547c0746-72a8-4fec-812a-8b926d2426ae  rack1

What is going on? Are the stats lying or is there an issue coming up?

EDIT: I was able to narrow the issue down to one of the nodes. The exporter on node 2 is showing:

cassandra_stats{cluster="Prod Cluster 2",datacenter="datacenter-prod",keyspace="iot_data",table="devices",name="org:apache:cassandra:metrics:table:iot_data:devices:writelatency:99thpercentile",} 268650.95

While node1 and node3 are like this:

cassandra_stats{cluster="Prod Cluster 2",datacenter="datacenter-prod",keyspace="iot_data",table="devices",name="org:apache:cassandra:metrics:table:iot_data:devices:writelatency:99thpercentile",} 10090.808

But still I dont know what is causing this on node2. It has no load, memory usage is fine as well?! Any ideas?

1

1 Answers

1
votes

Solved:

We have rabbitMQ running on the affected node. Starting yesterday we increased the read concurency which resulted in load peaks when bulks of data came in. These bulks are not visible on the overall metrics - but in these short moments cpu load went to 100%, which affected the cassandra writes on this node.