DSE Hadoop intermittent timeout errors during read

Question

I am having a weird error that has started happening a few weeks ago. We had to replace several analytics nodes and none of the hadoop jobs invoked by hive are able to finish. They crash on different stages with the similar error:

com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: ip-x-x-x-x.ec2.internal/x.x.x.x:9042 (com.datastax.driver.core.exceptions.DriverException: Timeout during read))
    at com.datastax.driver.core.exceptions.NoHostAvailableException.copy(NoHostAvailableException.java:65)
    at com.datastax.driver.core.DefaultResultSetFuture.extractCauseFromExecutionException(DefaultResultSetFuture.java:256)
    at com.datastax.driver.core.ArrayBackedResultSet$MultiPage.prepareNextRow(ArrayBackedResultSet.java:259)
    at com.datastax.driver.core.ArrayBackedResultSet$MultiPage.isExhausted(ArrayBackedResultSet.java:222)
    at com.datastax.driver.core.ArrayBackedResultSet$1.hasNext(ArrayBackedResultSet.java:115)
    at org.apache.cassandra.hadoop.cql3.CqlRecordReader$RowIterator.computeNext(CqlRecordReader.java:239)
    at org.apache.cassandra.hadoop.cql3.CqlRecordReader$RowIterator.computeNext(CqlRecordReader.java:218)
    at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143)
    at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138)
    at org.apache.cassandra.hadoop.cql3.CqlRecordReader.getProgress(CqlRecordReader.java:152)
    at org.apache.hadoop.hive.cassandra.cql3.input.CqlHiveRecordReader.getProgress(CqlHiveRecordReader.java:62)
    at org.apache.hadoop.hive.ql.io.HiveRecordReader.getProgress(HiveRecordReader.java:71)
    at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.getProgress(MapTask.java:260)
    at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:233)
    at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:216)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
    at org.apache.hadoop.mapred.Child$4.run(Child.java:266)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
    at org.apache.hadoop.mapred.Child.main(Child.java:260)
Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: ip-x-x-x-x.ec2.internal/x.x.x.x:9042 (com.datastax.driver.core.exceptions.DriverException: Timeout during read))
    at com.datastax.driver.core.RequestHandler.sendRequest(RequestHandler.java:103)
    at com.datastax.driver.core.RequestHandler$1.run(RequestHandler.java:175)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:744)

I turned debug logging, but still could not find anything that was happening around that time.

Thanks!

Could you expand on how you replaced the nodes? What happened to them? Specifically: Did you decommission them? — mildewey
Finally, can you verify that the DSE versions of each node match? — mildewey
My apologies for not returning to this question. The company that I worked for went out of business in November, but just before that happened, I have figured out the issue, but did not update it here because of the whole thing with the company. It turned out that application wrote a huge amount of data into a map column causing timeouts on reads. Since there was no descriptive errors in the log, it was hard to narrow down to that cause. — Roman Tumaykin

Roman Tumaykin Roman Tumaykin · Accepted Answer · 2015-01-22T23:18:35

Actually, the problem was in the huge amount of data written by the application into one of the map columns. It just happen co coincide with the cluster update. The hive jobs were just hanging and there was a misleading error message. By some trial and error I was able to narrow down the issue to that map column and delete the offending data.

DSE Hadoop intermittent timeout errors during read

1 Answers