How to retry in Cassandra if Read timeout occurs

Question

I was getting ReadTimeoutException quite frequently in my production Cassandra Cluster (10 nodes). So, to reproduce this issue in my local dev environment (Cassandra Cluster of four nodes), I ran my code and then stopped two CassandraDaemon. I got following exception

Exception in thread "main" com.datastax.driver.core.exceptions.UnavailableException: Not enough replica available for query at consistency ONE (1 required but only 0 alive) at com.datastax.driver.core.exceptions.UnavailableException.copy(UnavailableException.java:79) at com.datastax.driver.core.DefaultResultSetFuture.extractCauseFromExecutionException(DefaultResultSetFuture.java:269) at com.datastax.driver.core.ArrayBackedResultSet$MultiPage.prepareNextRow(ArrayBackedResultSet.java:285) at com.datastax.driver.core.ArrayBackedResultSet$MultiPage.isExhausted(ArrayBackedResultSet.java:245) at com.datastax.driver.core.ArrayBackedResultSet$1.hasNext(ArrayBackedResultSet.java:126) at com.cleartrail.keyspacedatamigrator.migrator.Migrator.migrateTimeline(Migrator.java:376) at com.cleartrail.keyspacedatamigrator.migrator.Migrator.migrateData(Migrator.java:267) at TestMigration.main(TestMigration.java:9) Caused by: com.datastax.driver.core.exceptions.UnavailableException: Not enough replica available for query at consistency ONE (1 required but only 0 alive) at com.datastax.driver.core.exceptions.UnavailableException.copy(UnavailableException.java:79) at com.datastax.driver.core.Responses$Error.asException(Responses.java:94) at com.datastax.driver.core.ArrayBackedResultSet$MultiPage$1.onSet(ArrayBackedResultSet.java:352) at com.datastax.driver.core.RequestHandler.setFinalResult(RequestHandler.java:183) at com.datastax.driver.core.RequestHandler.access$2300(RequestHandler.java:45) at com.datastax.driver.core.RequestHandler$SpeculativeExecution.setFinalResult(RequestHandler.java:748) at com.datastax.driver.core.RequestHandler$SpeculativeExecution.onSet(RequestHandler.java:587) at com.datastax.driver.core.Connection$Dispatcher.channelRead0(Connection.java:991) at com.datastax.driver.core.Connection$Dispatcher.channelRead0(Connection.java:913) at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:339) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:324) at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:254) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:339) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:324) at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:339) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:324) at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:242) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:339) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:324) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:847) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) at java.lang.Thread.run(Thread.java:745) Caused by: com.datastax.driver.core.exceptions.UnavailableException: Not enough replica available for query at consistency ONE (1 required but only 0 alive) at com.datastax.driver.core.Responses$Error$1.decode(Responses.java:48) at com.datastax.driver.core.Responses$Error$1.decode(Responses.java:37) at com.datastax.driver.core.Message$ProtocolDecoder.decode(Message.java:213) at com.datastax.driver.core.Message$ProtocolDecoder.decode(Message.java:204) at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:89) ... 13 more

My Cassandra connection code looks like this

SocketOptions so = new SocketOptions();
so.setReadTimeoutMillis(Integer.MAX_VALUE);
so.setConnectTimeoutMillis(sockettimeoutinmillis);

Builder builder = new Cluster.Builder().
                addContactPoints(connectionpoints).withPort(port);

builder.withPoolingOptions(new PoolingOptions().setCoreConnectionsPerHost(HostDistance.LOCAL, new PoolingOptions().getMaxConnectionsPerHost(HostDistance.LOCAL)));

cluster = builder
               .withRetryPolicy(DowngradingConsistencyRetryPolicy.INSTANCE)
               .withReconnectionPolicy(new ConstantReconnectionPolicy(10000L))
               .build();

session=cluster.connect();

I have provided retry policy while connecting to Cassandra, then why am I getting such exception? I haven't written any specific code to handle the ReadTimeoutException and retry. Is any specific code or handling required?

Andy Tolbert Andy Tolbert · Accepted Answer · 2015-09-10T15:58:16

If you want custom logic for handling Read/Write Timeout and UnavailableExceptions you can implement your own custom RetryPolicy.

You can override the onReadTimeout method to behave exactly as you desire.

The exception you provided however is an UnavailableException which is a cassandra coordinator telling you that no replicas are available to complete your query (meaning in this case that all replicas owning the data you are trying to read are marked DOWN in C*), therefore it did not even try and failed fast. In this case, retrying is probably not going to provide much value since you are likely to experience the same result. Based on the RetryPolicy you specified (DowngradingConsistencyRetryPolicy) what likely happened is that either a ReadTimeout or UnavailableException was encountered, and the RetryPolicy tried again at a lower ConsistencyLevel (ONE) and still failed because another UnavailableException is encountered.

A few questions for you which may help gain clarity:

What replication factor are you using for the Keyspace you are working with? You mention 4 nodes in your dev environment. If 1 node goes down and you have an RF of 1, then it's likely that 25% of your queries will hit an UnavailableException.
What consistency level are you querying with? If you aren't providing one, you are always using 'ONE' anyways, so DowngradingConsistencyRetryPolicy probably isn't benefitting you.

How to retry in Cassandra if Read timeout occurs

1 Answers