I have a Cassandra (2.2.1) cluster of 4 nodes which is used by Java client application. Replication factor is 3, consistency level is LOCAL_QUORUM for reads and writes. Each node has around 5 GB of data. Amounts of requests is approximately 2-4k per second. There are almost no delete operations, so small amount of tombstones is created.
I have noticed poor reads and writes performance some time ago, and it is getting worse with time - the cluster is getting really slow. Read (mostly often) and write timeouts have become very often. Hardware should not cause the problem, servers where cluster is deployed are really good in terms of disk performance, CPU and RAM resources.
The cause of the issue is unclear to me, but I have noticed several log entries which may point to the root cause:
Exception stack trace in Java client application log:
com.datastax.driver.core.exceptions.ReadTimeoutException: Cassandra timeout during read query at consistency LOCAL_QUORUM (2 responses were required but only 1 replica responded)
It is interesting that 1 node still responds.
Several entries of failed hints errors:
Failed replaying hints to /1.1.1.1; aborting (135922 delivered), error : Operation timed out - received only 0 responses.
Several following exceptions in cassandra logs:
Unexpected exception during request; channel = [id: 0x10fc77df, /2.2.2.2:54459 :> /1.1.1.1:9042] java.io.IOException: Error while read(...): Connection timed out at io.netty.channel.epoll.Native.readAddress(Native Method) ~[netty-all-4.0.23.Final.jar:4.0.23.Final] at io.netty.channel.epoll.EpollSocketChannel$EpollSocketUnsafe.doReadBytes(EpollSocketChannel.java:675) ~[netty-all-4.0.23.Final.jar:4.0.23.Final] at io.netty.channel.epoll.EpollSocketChannel$EpollSocketUnsafe.epollInReady(EpollSocketChannel.java:714) ~[netty-all-4.0.23.Final.jar:4.0.23.Final] at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:326) ~[netty-all-4.0.23.Final.jar:4.0.23.Final] at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:264) ~[netty-all-4.0.23.Final.jar:4.0.23.Final] at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) ~[netty-all-4.0.23.Final.jar:4.0.23.Final] at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137) ~[netty-all-4.0.23.Final.jar:4.0.23.Final] at java.lang.Thread.run(Thread.java:745) [na:1.8.0_66]
Failed batch errors:
Batch of prepared statements for [<...>] is of size 3453794, exceeding specified threshold of 1024000 by 2429794. (see batch_size_fail_threshold_in_kb)
Looks like the batch is too large, we have lots of batch operations by the way. Maybe batches affect the system?
Finally, exception which is seen mostly often - these entries appear one after another after switching logging level to DEBUG:
TIOStreamTransport.java:112 - Error closing output stream. java.net.SocketException: Socket closed at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:116) ~[na:1.8.0_66] at java.net.SocketOutputStream.write(SocketOutputStream.java:153) ~[na:1.8.0_66] at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82) ~[na:1.8.0_66] at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140) ~[na:1.8.0_66] at java.io.FilterOutputStream.close(FilterOutputStream.java:158) ~[na:1.8.0_66] at org.apache.thrift.transport.TIOStreamTransport.close(TIOStreamTransport.java:110) ~[libthrift-0.9.2.jar:0.9.2] at org.apache.cassandra.thrift.TCustomSocket.close(TCustomSocket.java:197) [apache-cassandra-2.2.1.jar:2.2.1] at org.apache.thrift.transport.TFramedTransport.close(TFramedTransport.java:89) [libthrift-0.9.2.jar:0.9.2] at org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:209) [apache-cassandra-2.2.1.jar:2.2.1] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_66] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_66] at java.lang.Thread.run(Thread.java:745) [na:1.8.0_66]
Do you have any ideas about what can cause this problem?
Thank you!