41
votes

While doing a bulk load of data, incrementing counters based on log data, I am encountering a timeout exception. Im using the Datastax 2.0-rc2 java driver.

Is this an issue with the server not being able to keep up (ie server side config issue), or is this an issue with the client getting bored waiting for the server to respond? Either way, is there an easy config change I can make that would fix this?

Exception in thread "main" com.datastax.driver.core.exceptions.WriteTimeoutException: Cassandra timeout during write query at consistency ONE (1 replica were required but only 0 acknowledged the write)
    at com.datastax.driver.core.exceptions.WriteTimeoutException.copy(WriteTimeoutException.java:54)
    at com.datastax.driver.core.ResultSetFuture.extractCauseFromExecutionException(ResultSetFuture.java:271)
    at com.datastax.driver.core.ResultSetFuture.getUninterruptibly(ResultSetFuture.java:187)
    at com.datastax.driver.core.Session.execute(Session.java:126)
    at jason.Stats.analyseLogMessages(Stats.java:91)
    at jason.Stats.main(Stats.java:48)
Caused by: com.datastax.driver.core.exceptions.WriteTimeoutException: Cassandra timeout during write query at consistency ONE (1 replica were required but only 0 acknowledged the write)
    at com.datastax.driver.core.exceptions.WriteTimeoutException.copy(WriteTimeoutException.java:54)
    at com.datastax.driver.core.Responses$Error.asException(Responses.java:92)
    at com.datastax.driver.core.ResultSetFuture$ResponseCallback.onSet(ResultSetFuture.java:122)
    at com.datastax.driver.core.RequestHandler.setFinalResult(RequestHandler.java:224)
    at com.datastax.driver.core.RequestHandler.onSet(RequestHandler.java:373)
    at com.datastax.driver.core.Connection$Dispatcher.messageReceived(Connection.java:510)
    at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
    at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
    at org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
    at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:296)
    at org.jboss.netty.handler.codec.oneone.OneToOneDecoder.handleUpstream(OneToOneDecoder.java:70)
    at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
    at org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
    at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:296)
    at org.jboss.netty.handler.codec.frame.FrameDecoder.unfoldAndFireMessageReceived(FrameDecoder.java:462)
    at org.jboss.netty.handler.codec.frame.FrameDecoder.callDecode(FrameDecoder.java:443)
    at org.jboss.netty.handler.codec.frame.FrameDecoder.messageReceived(FrameDecoder.java:303)
    at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
    at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
    at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
    at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:268)
    at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:255)
    at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88)
    at org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:109)
    at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:312)
    at org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:90)
    at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
    at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
    at org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:744)
Caused by: com.datastax.driver.core.exceptions.WriteTimeoutException: Cassandra timeout during write query at consistency ONE (1 replica were required but only 0 acknowledged the write)
    at com.datastax.driver.core.Responses$Error$1.decode(Responses.java:53)
    at com.datastax.driver.core.Responses$Error$1.decode(Responses.java:33)
    at com.datastax.driver.core.Message$ProtocolDecoder.decode(Message.java:165)
    at org.jboss.netty.handler.codec.oneone.OneToOneDecoder.handleUpstream(OneToOneDecoder.java:66)
    ... 21 more

One of the nodes reports this at roughly the time it occured:

ERROR [Native-Transport-Requests:12539] 2014-02-16 23:37:22,191 ErrorMessage.java (line 222) Unexpected exception during request
java.io.IOException: Connection reset by peer
    at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
    at sun.nio.ch.SocketDispatcher.read(Unknown Source)
    at sun.nio.ch.IOUtil.readIntoNativeBuffer(Unknown Source)
    at sun.nio.ch.IOUtil.read(Unknown Source)
    at sun.nio.ch.SocketChannelImpl.read(Unknown Source)
    at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:64)
    at org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:109)
    at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:312)
    at org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:90)
    at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.lang.Thread.run(Unknown Source)
4

4 Answers

43
votes

While I don't understand the root cause of this issue, I was able to solve the problem by increasing the timeout value in the conf/cassandra.yaml file.

write_request_timeout_in_ms: 20000
30
votes

We experienced similar problems on a single node in an ESX cluster with SAN storage attached (which is not recommended by datastax, but we have no other options at this moment).

Note: the settings below can be a big blow to the maximum performance Cassandra can achieve, but we chose a stable system over high performance.

While running iostat -xmt 1 we found high w_await times at the same time the WriteTimeoutExceptions occured. It turned out the memtable could not be written to disk within the default write_request_timeout_in_ms: 2000 setting.

We significantly reduced the memtable size from 512Mb (defaults to 25% of heap space, which was 2Gb in our case) to 32Mb:

# Total permitted memory to use for memtables. Cassandra will stop
# accepting writes when the limit is exceeded until a flush completes,
# and will trigger a flush based on memtable_cleanup_threshold
# If omitted, Cassandra will set both to 1/4 the size of the heap.
# memtable_heap_space_in_mb: 2048
memtable_offheap_space_in_mb: 32

We also slightly increated the write timeout to 3 seconds:

write_request_timeout_in_ms: 3000

Also make sure you write regularly to disk if you have high IO wait times:

#commitlog_sync: batch
#commitlog_sync_batch_window_in_ms: 2
#
# the other option is "periodic" where writes may be acked immediately
# and the CommitLog is simply synced every commitlog_sync_period_in_ms
# milliseconds.
commitlog_sync: periodic
commitlog_sync_period_in_ms: 10000

These settings allowed the memtable to remain small and be written often. The exceptions were solved and we survived the stress tests that were run on the sytem.

2
votes

It is coordinator (so the server) timing out waiting for acknowledgements for the write.

1
votes

Its worth double checking your GC settings for Cassandra.

In my case I was using a semaphore to throttle async writes and still (sometimes) getting timeouts.

It transpired that I was using unsuitable GC settings, I'd been using cassandra-unit for convenience which had the unintended consequence of running with the default VM settings. Consequently we would eventually trigger hit a stop-the-world GC resulting in a write timeout. Applying the same GC settings as my running cassandra docker image and all is fine.

This might be an uncommon cause but it would have helped me so it seems worth recording here.