0
votes

We recently deployed micro-services into our production and these micro-service communicates with Cassandra nodes for reads/writes.

After deployment, we started noticing sudden drop in CPU to 0 on all cassandra nodes in primary DC. This is happening at least once per day. when this happens each time, we see randomly 2 nodes (in SAME DC) are not able to reachable to each other ("nodetool describecluster") and when we check "nodetool tpstats", these 2 nodes has higher number of ACTIVE Native-Transport-Requests b/w 100-200. Also these 2 nodes are storing HINTS for each other but when i do longer "pings" b/w them i don't see any packet loss. when we restart those 2 cassandra nodes, issue will be fixed at that moment. This is happening since 2 weeks.

We use Apache Cassandra 2.2.8.

Also microservices logs are having reads/writes timeouts before sudden drop in CPU on all cassandra nodes.

2

2 Answers

0
votes

You might be using token aware load balancing policy on client, and updating a single partition or range heavily. In which case all the coordination load will be focused on the single replica set. Can change your application to use RoundRobin (or dc aware round robin) LoadBalancingPolicy and it will likely resolve. If it does you have a hotspot in your application and you might want to give attention to your data model.

0
votes

It does look like a datamodel problem (hot partitions causing issues in specific replicas).

But in any case you might want to add the following to your cassandra-env.sh to see if it helps:

JVM_OPTS="$JVM_OPTS -Dcassandra.max_queued_native_transport_requests=1024"

More information about this here: https://issues.apache.org/jira/browse/CASSANDRA-11363