I know there are other very similar questions on Stackoverflow but those either didn't get answered or didn't help me out. In contrast to those questions I put much more stack trace and log file information into this question. I hope that helps, although it made the question to become sorta long and ugly. I'm sorry.
Setup
I'm running a 9 node cluster on Amazon EC2 using m3.xlarge
instances with DSE (DataStax Enterprise) version 4.6 installed. For each workload (Cassandra, Search and Analytics) 3 nodes are used. DSE 4.6 bundles Spark 1.1 and Cassandra 2.0.
Issue
The application (Spark/Shark-Shell) gets removed after ~3 minutes even if I do not run any query. Queries on small datasets run successful as long as they finish within ~3 minutes.
I would like to analyze much larger datasets. Therefore I need the application (shell) not to get removed after ~3 minutes.
Error description
On the Spark or Shark shell, after idling ~3 minutes or while executing (long-running) queries, Spark will eventually abort and give the following stack trace:
15/08/25 14:58:09 ERROR cluster.SparkDeploySchedulerBackend: Application has been killed. Reason: Master removed our application: FAILED
org.apache.spark.SparkException: Job aborted due to stage failure: Master removed our application: FAILED
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688)
at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
FAILED: Execution Error, return code -101 from shark.execution.SparkTask
This is not very helpful (to me), that's why I'm going to show you more log file information.
Error Details / Log Files
Master
From the master.log
I think the interesing parts are
INFO 2015-08-25 09:19:59 org.apache.spark.deploy.master.DseSparkMaster: akka.tcp://[email protected]:46715 got disassociated, removing it.
INFO 2015-08-25 09:19:59 org.apache.spark.deploy.master.DseSparkMaster: akka.tcp://[email protected]:42136 got disassociated, removing it.
and
ERROR 2015-08-25 09:21:01 org.apache.spark.deploy.master.DseSparkMaster: Application Shark::ip-172-31-46-49 with ID app-20150825091745-0007 failed 10 times, removing it
INFO 2015-08-25 09:21:01 org.apache.spark.deploy.master.DseSparkMaster: Removing app app-20150825091745-0007
Why do the worker nodes get disassociated?
In case you need to see it, I attached the master's executor (ID 1) stdout
as well. The executors stderr
is empty. However, I think it shows nothing useful to tackle the issue.
On the Spark Master UI I verified to see all worker nodes to be ALIVE
. The second screenshot shows the application details.
There is one executor spawned on the master instance while executors on the two worker nodes get respawned until the whole application is removed. Is that okay or does it indicate some issue? I think it might be related to the "(it) failed 10 times" error message from above.
Worker logs
Furthermore I can show you logs of the two Spark worker nodes. I removed most of the class path arguments to shorten the logs. Let me know if you need to see it. As each worker node spawns multiple executors I attached links to some (not all) executor stdout
and stderr
dumps. Dumps of the remaining executors look basically the same.
Worker I
Worker II
The executor dumps seem to indicate some issue with permission and/or timeout. But from the dumps I can't figure out any details.
Attempts
As mentioned above, there are some similar questions but none of those got answered or it didn't help me to solve the issue. Anyway, things I tried and verified are:
- Opened port 2552. Nothing changes.
- Increased spark.akka.askTimeout which results in the Spark/Shark app to live longer but eventually it still gets removed.
- Ran the Spark shell locally with
spark.master=local[4]
. On the one hand this allowed me to run queries longer than ~3 minutes successfully, on the other hand it obviously doesn't take advantage of the distributed environment.
Summary
To sum up, one could say that the timeouts and the fact long-running queries are successfully executed in local mode all indicate some misconfiguration. Though I cannot be sure and I don't know how to fix it.
Any help would be very much appreciated.
Edit: Two of the Analytics and two of the Solr nodes were added after the initial setup of the cluster. Just in case that matters.
Edit (2): I was able to work around the issue described above by replacing the Analytics nodes with three freshly installed Analytics nodes. I can now run queries on much larger datasets without the shell being removed. I intend not to put this as an answer to the question as it is still unclear what is wrong with the three original Analytics nodes. However, as it is a cluster for testing purposes, it was okay to simply replace the nodes (after replacing the nodes I performed a nodetool rebuild -- Cassandra
on each of the new nodes to recover their data from the Cassandra datacenter).