Spark streaming job fails after getting stopped by Driver

Question

I have a spark streaming job which reads in data from Kafka and does some operations on it. I am running the job over a yarn cluster, Spark 1.4.1, which has two nodes with 16 GB RAM each and 16 cores each.

I have these conf passed to the spark-submit job :

--master yarn-cluster --num-executors 3 --driver-memory 4g --executor-memory 2g --executor-cores 3

The job returns this error and finishes after running for a short while :

INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 11,
(reason: Max number of executor failures reached)

.....

ERROR scheduler.ReceiverTracker: Deregistered receiver for stream 0:
Stopped by driver

Updated :

These logs were found too :

INFO yarn.YarnAllocator: Received 3 containers from YARN, launching executors on 3 of them.....

INFO yarn.ApplicationMaster$AMEndpoint: Driver terminated or disconnected! Shutting down.

....

INFO yarn.YarnAllocator: Received 2 containers from YARN, launching executors on 2 of them.

INFO yarn.ExecutorRunnable: Starting Executor Container.....

INFO yarn.ApplicationMaster$AMEndpoint: Driver terminated or disconnected! Shutting down...

INFO yarn.YarnAllocator: Completed container container_e10_1453801197604_0104_01_000006 (state: COMPLETE, exit status: 1)

INFO yarn.YarnAllocator: Container marked as failed: container_e10_1453801197604_0104_01_000006. Exit status: 1. Diagnostics: Exception from container-launch.
Container id: container_e10_1453801197604_0104_01_000006
Exit code: 1
Stack trace: ExitCodeException exitCode=1: 
    at org.apache.hadoop.util.Shell.runCommand(Shell.java:576)
    at org.apache.hadoop.util.Shell.run(Shell.java:487)
    at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:753)
    at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
    at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
    at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

Container exited with a non-zero exit code 1

What might be the reasons for this? Appreciate some help.

Thanks

You probably have some other errors/infos before this that sound like killing executor, lost executor . Can you look in the log for this and post those eror messages? — Radu Ionescu
@RaduIonescu I have added some logs which looked speculative to me. Could you have a look. Thanks. — void
To me it seems you either call sparkContext.stop() or you are using too much memory in the driver (e.g. calling collect() on the whole _RDD_s). You could try running it requiring explicitly more resources or with a small dataset to confirm this. — Radu Ionescu
Are you using YARN log aggregation? Set yarn.log-aggregation-enable to true. — Justin Peel

Faisal Ahmed Siddiqui Faisal Ahmed Siddiqui · Accepted Answer · 2016-03-03T23:16:32

can you please show your scala/java code that is reading from kafka? I suspect you probably not creating your SparkConf correctly.

Try something like

SparkConf sparkConf = new SparkConf().setAppName("ApplicationName");

also try running application in yarn-client mode and share the output.

Spark streaming job fails after getting stopped by Driver

2 Answers