8
votes

I'm running a Spark job with Spark version 1.4 and Cassandra 2.18. I telnet from master and it works to cassandra machine. Sometimes the job runs fine and sometimes I get the following exception. Why would this happen only sometimes?

"Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 7, 172.28.0.162): java.io.IOException: Failed to open native connection to Cassandra at {172.28.0.164}:9042 at com.datastax.spark.connector.cql.CassandraConnector$.com$datastax$spark$connector$cql$CassandraConnector$$createSession(CassandraConnector.scala:155) "

It sometimes also gives me this exception along with the upper one:

Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /172.28.0.164:9042 (com.datastax.driver.core.TransportException: [/172.28.0.164:9042] Connection has been closed))

1
Have you seen this question?zapstar
Yep, I do have. The problem is I get it sometimes and sometimes my code runs fines. When I restart all my master and slave it works and after runnings my job 2-3 times it again gives me this error. I closed all the TIME_WAIT ports but still see this issueNipun

1 Answers

3
votes

I had the second error "NoHostAvailableException" happen to me quite a few times this week as I was porting Python spark to Java Spark.

I was having issues with the driver thread being nearly out of memory and the GC was taking up all my cores (98% of all 8 core), pausing the JVM all the time.

In python when this happens it's much more obvious (to me) so it took me a bit of time to realize what was going on, so I got this error quite a few times.

I had two theory on the root cause, but the solution was not having the GC go crazy.

  1. First theory, was that because it was pausing so often, I just couldn't connect to Cassandra.
  2. Second theory: Cassandra was running on the same machine as Spark and the JVM was taking 100% of all CPU so Cassandra just couldn't answer in time and it looked to the driver like there were no Cassandra host.

Hope this helps!