We recently deployed micro-services into our production and these micro-service communicates with Cassandra nodes for reads/writes.
After deployment, we started noticing sudden drop in CPU to 0 on all cassandra nodes in primary DC. This is happening at least once per day. when this happens each time, we see randomly 2 nodes (in SAME DC) are not able to reachable to each other ("nodetool describecluster") and when we check "nodetool tpstats", these 2 nodes has higher number of ACTIVE Native-Transport-Requests b/w 100-200. Also these 2 nodes are storing HINTS for each other but when i do longer "pings" b/w them i don't see any packet loss. when we restart those 2 cassandra nodes, issue will be fixed at that moment. This is happening since 2 weeks.
We use Apache Cassandra 2.2.8.
Also microservices logs are having reads/writes timeouts before sudden drop in CPU on all cassandra nodes.