9
votes

I have a spark streaming job which reads in data from Kafka and does some operations on it. I am running the job over a yarn cluster, Spark 1.4.1, which has two nodes with 16 GB RAM each and 16 cores each.

I have these conf passed to the spark-submit job :

--master yarn-cluster --num-executors 3 --driver-memory 4g --executor-memory 2g --executor-cores 3

The job returns this error and finishes after running for a short while :

INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 11,
(reason: Max number of executor failures reached)

.....

ERROR scheduler.ReceiverTracker: Deregistered receiver for stream 0:
Stopped by driver

Updated :

These logs were found too :

INFO yarn.YarnAllocator: Received 3 containers from YARN, launching executors on 3 of them.....

INFO yarn.ApplicationMaster$AMEndpoint: Driver terminated or disconnected! Shutting down.

....

INFO yarn.YarnAllocator: Received 2 containers from YARN, launching executors on 2 of them.

INFO yarn.ExecutorRunnable: Starting Executor Container.....

INFO yarn.ApplicationMaster$AMEndpoint: Driver terminated or disconnected! Shutting down...

INFO yarn.YarnAllocator: Completed container container_e10_1453801197604_0104_01_000006 (state: COMPLETE, exit status: 1)

INFO yarn.YarnAllocator: Container marked as failed: container_e10_1453801197604_0104_01_000006. Exit status: 1. Diagnostics: Exception from container-launch.
Container id: container_e10_1453801197604_0104_01_000006
Exit code: 1
Stack trace: ExitCodeException exitCode=1: 
    at org.apache.hadoop.util.Shell.runCommand(Shell.java:576)
    at org.apache.hadoop.util.Shell.run(Shell.java:487)
    at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:753)
    at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
    at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
    at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

Container exited with a non-zero exit code 1

What might be the reasons for this? Appreciate some help.

Thanks

2
You probably have some other errors/infos before this that sound like killing executor, lost executor . Can you look in the log for this and post those eror messages?Radu Ionescu
@RaduIonescu I have added some logs which looked speculative to me. Could you have a look. Thanks.void
To me it seems you either call sparkContext.stop() or you are using too much memory in the driver (e.g. calling collect() on the whole _RDD_s). You could try running it requiring explicitly more resources or with a small dataset to confirm this.Radu Ionescu
I tried both. Even with a small dataset, it's happeningvoid
Are you using YARN log aggregation? Set yarn.log-aggregation-enable to true.Justin Peel

2 Answers

3
votes

can you please show your scala/java code that is reading from kafka? I suspect you probably not creating your SparkConf correctly.

Try something like

SparkConf sparkConf = new SparkConf().setAppName("ApplicationName");

also try running application in yarn-client mode and share the output.

-3
votes

I got the same issue. and I have found 1 solution to fix the issue by removing sparkContext.stop() at the end of main function, leave the stop action for GC.

Spark team has resolved the issue in Spark core, however, the fix has just been master branch so far. We need to wait until the fix has been updated into the new release.

https://issues.apache.org/jira/browse/SPARK-12009