running a spark submit job as cluster deploy mode fails but passes with client

Question

EDITI: by removing the conf setting in the app for 'setMaster' I'm able to run yarn-cluster successfully - if anyone coudl help with spark master as cluster deploy - that'd be fantastic

I'm trying to set up spark on a local testmachine so that I can read from an s3 bucket and then write back to it.

Running the jar/application using client works fine, well, fine in that it goes off to the bucket and creates a file and comes back again.

However I need this to work in cluster mode so that it more closely resembles our prod environment yet it's constantly failing - no real sensible messages in the logs that I can see and little feedback to go on.

Any help is greatly appreciated - I'm very new to spark/hadoop so may have overlooked something obvious.

I also tried running with yarn-cluster as the master but that failed for a different reason (saying it couldn't find the s3Native classes - which I pass in as jars)

This is one a windows ev

The command I'm running:

c:\>spark-submit --jars="C:\Spark\hadoop\share\hadoop\common\lib\hadoop-aws-2.7.1.jar,C:\Spark\hadoop\share\hadoop\common\lib\aws-java-sdk-1.7.4.jar" --verbose --deploy-mode cluster --master spark://127.0.0.1:7077 --class FileInputRename c:\sparkSubmit\sparkSubmit_NoJarSetInConf.jar "s3://bucket/jar/fileInputRename.txt"

The output from this on the console is:

    Using properties file: C:\Spark\bin\..\conf\spark-defaults.conf
Parsed arguments:
  master                  spark://127.0.0.1:7077
  deployMode              cluster
  executorMemory          null
  executorCores           null
  totalExecutorCores      null
  propertiesFile          C:\Spark\bin\..\conf\spark-defaults.conf
  driverMemory            null
  driverCores             null
  driverExtraClassPath    null
  driverExtraLibraryPath  null
  driverExtraJavaOptions  null
  supervise               false
  queue                   null
  numExecutors            null
  files                   null
  pyFiles                 null
  archives                null
  mainClass               FileInputRename
  primaryResource         file:/c:/sparkSubmit/sparkSubmit_NoJarSetInConf.jar
  name                    FileInputRename
  childArgs               [s3://SessionCam-Steve/jar/fileInputRename.txt]
  jars                    file:/C:/Spark/hadoop/share/hadoop/common/lib/hadoop-aws-2.7.1.jar,file:/C:/Spark/hadoop/share/hadoop/common/lib/aws-java-sdk-1.7.4.jar
  packages                null
  packagesExclusions      null
  repositories            null
  verbose                 true

Spark properties used, including those specified through
 --conf and those from the properties file C:\Spark\bin\..\conf\spark-defaults.conf:



Running Spark using the REST application submission protocol.
Main class:
org.apache.spark.deploy.rest.RestSubmissionClient
Arguments:
file:/c:/sparkSubmit/sparkSubmit_NoJarSetInConf.jar
FileInputRename
s3://SessionCam-Steve/jar/fileInputRename.txt
System properties:
SPARK_SUBMIT -> true
spark.driver.supervise -> false
spark.app.name -> FileInputRename
spark.jars -> file:/C:/Spark/hadoop/share/hadoop/common/lib/hadoop-aws-2.7.1.jar,file:/C:/Spark/hadoop/share/hadoop/common/lib/aws-java-sdk-1.7.4.jar,file:/c:/sparkSubmit/sparkSubmit_NoJarSetInConf.jar
spark.submit.deployMode -> cluster
spark.master -> spark://127.0.0.1:7077
Classpath elements:



16/03/24 12:01:56 INFO rest.RestSubmissionClient: Submitting a request to launch an application in spark://127.0.0.1:7077.

After a few more seconds it shows the c prompt and nothing else. The logs on 8080 :

Application ID  Name    Cores   Memory per Node Submitted Time  User    State   Duration
app-20160324120221-0016 FileInputRename 1   1024.0 MB   2016/03/24 12:02:21 Administrator   FINISHED    3 s

where the error message only shows:

    16/03/24 12:02:24 INFO spark.SecurityManager: Changing view acls to: Administrator
16/03/24 12:02:24 INFO spark.SecurityManager: Changing modify acls to: Administrator
16/03/24 12:02:24 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(Administrator); users with modify permissions: Set(Administrator)

If I run the yarn-cluster as main so that this is my command:

c:>spark-submit --jars="C:\Spark\hadoop\share\hadoop\common\lib\hadoop-aws-2.7.1.jar,C:\Spark\hadoop\share\hadoop\common\lib\aws-java-sdk-1.7.4.jar" --verbose --master yarn-cluster --class FileInputRename c:\sparkSubmit\sparkSubmit_NoJarSetInConf.jar "s3://SessionCam-Steve/jar/fileInputRename.txt"

The output and exception:

    Using properties file: C:\Spark\bin\..\conf\spark-defaults.conf
Parsed arguments:
  master                  yarn-cluster
  deployMode              null
  executorMemory          null
  executorCores           null
  totalExecutorCores      null
  propertiesFile          C:\Spark\bin\..\conf\spark-defaults.conf
  driverMemory            null
  driverCores             null
  driverExtraClassPath    null
  driverExtraLibraryPath  null
  driverExtraJavaOptions  null
  supervise               false
  queue                   null
  numExecutors            null
  files                   null
  pyFiles                 null
  archives                null
  mainClass               FileInputRename
  primaryResource         file:/c:/sparkSubmit/sparkSubmit_NoJarSetInConf.jar
  name                    FileInputRename
  childArgs               [s3://SessionCam-Steve/jar/fileInputRename.txt]
  jars                    file:/C:/Spark/hadoop/share/hadoop/common/lib/hadoop-aws-2.7.1.jar,file:/C:/Spark/hadoop/share/hadoop/common/lib/aws-java-sdk-1.7.4.jar
  packages                null
  packagesExclusions      null
  repositories            null
  verbose                 true

Spark properties used, including those specified through
 --conf and those from the properties file C:\Spark\bin\..\conf\spark-defaults.conf:



Main class:
org.apache.spark.deploy.yarn.Client
Arguments:
--name
FileInputRename
--addJars
file:/C:/Spark/hadoop/share/hadoop/common/lib/hadoop-aws-2.7.1.jar,file:/C:/Spark/hadoop/share/hadoop/common/lib/aws-java-sdk-1.7.4.jar
--jar
file:/c:/sparkSubmit/sparkSubmit_NoJarSetInConf.jar
--class
FileInputRename
--arg
s3://SessionCam-Steve/jar/fileInputRename.txt
System properties:
SPARK_SUBMIT -> true
spark.app.name -> FileInputRename
spark.submit.deployMode -> cluster
spark.master -> yarn-cluster
Classpath elements:



16/03/24 12:05:23 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
16/03/24 12:05:23 INFO yarn.Client: Requesting a new application from cluster with 1 NodeManagers
16/03/24 12:05:23 INFO yarn.Client: Verifying our application has not requested more than the maximum memory capability of the cluster (8192 MB per container)
16/03/24 12:05:23 INFO yarn.Client: Will allocate AM container, with 1408 MB memory including 384 MB overhead
16/03/24 12:05:23 INFO yarn.Client: Setting up container launch context for our AM
16/03/24 12:05:23 INFO yarn.Client: Setting up the launch environment for our AM container
16/03/24 12:05:23 INFO yarn.Client: Preparing resources for our AM container
16/03/24 12:05:24 WARN : Your hostname, WIN-EU4MXZ2GSIW resolves to a loopback/non-reachable address: fe80:0:0:0:0:5efe:a94:1d11%14, but we couldn't find any external IP address!
16/03/24 12:05:25 INFO yarn.Client: Uploading resource file:/C:/Spark/lib/spark-assembly-1.6.1-hadoop2.6.0.jar -> hdfs://0.0.0.0:19000/user/Administrator/.sparkStaging/application_1458817514983_0004/spark-assembly-1.6.1-had
16/03/24 12:05:27 INFO yarn.Client: Uploading resource file:/c:/sparkSubmit/sparkSubmit_NoJarSetInConf.jar -> hdfs://0.0.0.0:19000/user/Administrator/.sparkStaging/application_1458817514983_0004/sparkSubmit_NoJarSetInConf.j
16/03/24 12:05:27 INFO yarn.Client: Uploading resource file:/C:/Spark/hadoop/share/hadoop/common/lib/hadoop-aws-2.7.1.jar -> hdfs://0.0.0.0:19000/user/Administrator/.sparkStaging/application_1458817514983_0004/hadoop-aws-2.
16/03/24 12:05:27 INFO yarn.Client: Uploading resource file:/C:/Spark/hadoop/share/hadoop/common/lib/aws-java-sdk-1.7.4.jar -> hdfs://0.0.0.0:19000/user/Administrator/.sparkStaging/application_1458817514983_0004/aws-java-sd
16/03/24 12:05:27 INFO yarn.Client: Uploading resource file:/C:/temp/2/spark-12375b13-dac4-42b8-9ff6-19b0f895c5d1/__spark_conf__7363738392648975127.zip -> hdfs://0.0.0.0:19000/user/Administrator/.sparkStaging/application_14
16/03/24 12:05:28 INFO spark.SecurityManager: Changing view acls to: Administrator
16/03/24 12:05:28 INFO spark.SecurityManager: Changing modify acls to: Administrator
16/03/24 12:05:28 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(Administrator); users with modify permissions: Set(Administrator)
16/03/24 12:05:28 INFO yarn.Client: Submitting application 4 to ResourceManager
16/03/24 12:05:29 INFO impl.YarnClientImpl: Submitted application application_1458817514983_0004
16/03/24 12:05:30 INFO yarn.Client: Application report for application_1458817514983_0004 (state: ACCEPTED)
16/03/24 12:05:30 INFO yarn.Client:
         client token: N/A
         diagnostics: N/A
         ApplicationMaster host: N/A
         ApplicationMaster RPC port: -1
         queue: default
         start time: 1458821128787
         final status: UNDEFINED
         tracking URL: http://WIN-EU4MXZ2GSIW:8088/proxy/application_1458817514983_0004/
         user: Administrator
16/03/24 12:05:31 INFO yarn.Client: Application report for application_1458817514983_0004 (state: ACCEPTED)
16/03/24 12:05:32 INFO yarn.Client: Application report for application_1458817514983_0004 (state: ACCEPTED)
16/03/24 12:05:33 INFO yarn.Client: Application report for application_1458817514983_0004 (state: ACCEPTED)
16/03/24 12:05:34 INFO yarn.Client: Application report for application_1458817514983_0004 (state: ACCEPTED)
16/03/24 12:05:35 INFO yarn.Client: Application report for application_1458817514983_0004 (state: ACCEPTED)
16/03/24 12:05:36 INFO yarn.Client: Application report for application_1458817514983_0004 (state: ACCEPTED)
16/03/24 12:05:37 INFO yarn.Client: Application report for application_1458817514983_0004 (state: ACCEPTED)
16/03/24 12:05:38 INFO yarn.Client: Application report for application_1458817514983_0004 (state: ACCEPTED)
16/03/24 12:05:39 INFO yarn.Client: Application report for application_1458817514983_0004 (state: ACCEPTED)
16/03/24 12:05:40 INFO yarn.Client: Application report for application_1458817514983_0004 (state: ACCEPTED)
16/03/24 12:05:41 INFO yarn.Client: Application report for application_1458817514983_0004 (state: ACCEPTED)
16/03/24 12:05:42 INFO yarn.Client: Application report for application_1458817514983_0004 (state: ACCEPTED)
16/03/24 12:05:43 INFO yarn.Client: Application report for application_1458817514983_0004 (state: ACCEPTED)
16/03/24 12:05:44 INFO yarn.Client: Application report for application_1458817514983_0004 (state: ACCEPTED)
16/03/24 12:05:45 INFO yarn.Client: Application report for application_1458817514983_0004 (state: ACCEPTED)
16/03/24 12:05:46 INFO yarn.Client: Application report for application_1458817514983_0004 (state: ACCEPTED)
16/03/24 12:05:47 INFO yarn.Client: Application report for application_1458817514983_0004 (state: ACCEPTED)
16/03/24 12:05:48 INFO yarn.Client: Application report for application_1458817514983_0004 (state: ACCEPTED)
16/03/24 12:05:49 INFO yarn.Client: Application report for application_1458817514983_0004 (state: ACCEPTED)
16/03/24 12:05:50 INFO yarn.Client: Application report for application_1458817514983_0004 (state: ACCEPTED)
16/03/24 12:05:51 INFO yarn.Client: Application report for application_1458817514983_0004 (state: ACCEPTED)
16/03/24 12:05:52 INFO yarn.Client: Application report for application_1458817514983_0004 (state: ACCEPTED)
16/03/24 12:05:53 INFO yarn.Client: Application report for application_1458817514983_0004 (state: ACCEPTED)
16/03/24 12:05:54 INFO yarn.Client: Application report for application_1458817514983_0004 (state: ACCEPTED)
16/03/24 12:05:55 INFO yarn.Client: Application report for application_1458817514983_0004 (state: ACCEPTED)
16/03/24 12:05:57 INFO yarn.Client: Application report for application_1458817514983_0004 (state: ACCEPTED)
16/03/24 12:05:58 INFO yarn.Client: Application report for application_1458817514983_0004 (state: ACCEPTED)
16/03/24 12:05:59 INFO yarn.Client: Application report for application_1458817514983_0004 (state: ACCEPTED)
16/03/24 12:06:00 INFO yarn.Client: Application report for application_1458817514983_0004 (state: ACCEPTED)
16/03/24 12:06:01 INFO yarn.Client: Application report for application_1458817514983_0004 (state: ACCEPTED)
16/03/24 12:06:02 INFO yarn.Client: Application report for application_1458817514983_0004 (state: ACCEPTED)
16/03/24 12:06:03 INFO yarn.Client: Application report for application_1458817514983_0004 (state: ACCEPTED)
16/03/24 12:06:04 INFO yarn.Client: Application report for application_1458817514983_0004 (state: ACCEPTED)
16/03/24 12:06:05 INFO yarn.Client: Application report for application_1458817514983_0004 (state: ACCEPTED)
16/03/24 12:06:06 INFO yarn.Client: Application report for application_1458817514983_0004 (state: ACCEPTED)
16/03/24 12:06:07 INFO yarn.Client: Application report for application_1458817514983_0004 (state: ACCEPTED)
16/03/24 12:06:08 INFO yarn.Client: Application report for application_1458817514983_0004 (state: ACCEPTED)
16/03/24 12:06:09 INFO yarn.Client: Application report for application_1458817514983_0004 (state: ACCEPTED)
16/03/24 12:06:10 INFO yarn.Client: Application report for application_1458817514983_0004 (state: FAILED)
16/03/24 12:06:10 INFO yarn.Client:
         client token: N/A
         diagnostics: Application application_1458817514983_0004 failed 2 times due to AM Container for appattempt_1458817514983_0004_000002 exited with  exitCode: 15
For more detailed output, check application tracking page:http://WIN-EU4MXZ2GSIW:8088/cluster/app/application_1458817514983_0004Then, click on links to logs of each attempt.
Diagnostics: Exception from container-launch.
Container id: container_1458817514983_0004_02_000001
Exit code: 15
Stack trace: ExitCodeException exitCode=15:
        at org.apache.hadoop.util.Shell.runCommand(Shell.java:545)
        at org.apache.hadoop.util.Shell.run(Shell.java:456)
        at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722)
        at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)

Shell output:         1 file(s) moved.


Container exited with a non-zero exit code 15
Failing this attempt. Failing the application.
         ApplicationMaster host: N/A
         ApplicationMaster RPC port: -1
         queue: default
         start time: 1458821128787
         final status: FAILED
         tracking URL: http://WIN-EU4MXZ2GSIW:8088/cluster/app/application_1458817514983_0004
         user: Administrator
Exception in thread "main" org.apache.spark.SparkException: Application application_1458817514983_0004 finished with failed status
        at org.apache.spark.deploy.yarn.Client.run(Client.scala:1034)
        at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1081)
        at org.apache.spark.deploy.yarn.Client.main(Client.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
        at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
        at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
16/03/24 12:06:10 INFO util.ShutdownHookManager: Shutdown hook called
16/03/24 12:06:10 INFO util.ShutdownHookManager: Deleting directory C:\temp\2\spark-12375b13-dac4-42b8-9ff6-19b0f895c5d1

this creates two application ids in the gui:

Application ID  Name    Cores   Memory per Node Submitted Time  User    State   Duration
app-20160324120600-0018 FileInputRename 2   1024.0 MB   2016/03/24 12:06:00 Administrator   FINISHED    9 s
app-20160324120543-0017 FileInputRename 2   1024.0 MB   2016/03/24 12:05:43 Administrator   FINISHED    8 s

both of which have this as the exception:

 16/03/24 12:05:49 ERROR executor.Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3native.NativeS3FileSystem not found
    at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2074)
    at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2578)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591)
    at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
    at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
    at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
    at org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:84)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1193)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1185)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3native.NativeS3FileSystem not found
    at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1980)
    at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2072)
    ... 16 more

it'd be fantastic and a huge relief if I could get one of these working - thank you in advance for any help.

Igor Berman Igor Berman · Accepted Answer · 2016-03-24T20:48:40

Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3native.NativeS3FileSystem not found

you have problem with classpath. in cluster mode the "application" is executed on one of nodes which probably has their own classpath, so

Either hadoop-aws-2.7.1.jar for some reason is not present at all(despite the fact that you provide it with --jars - check if it present on all workers at provided path)

or there is another hadoop-aws jar in classpath with another version(personally I think it would be 2nd variant). Try remove --jars=... argument and see if it helps.

running a spark submit job as cluster deploy mode fails but passes with client

2 Answers