3
votes

Spark submit in yarn cluster mode failing but its successful in client mode

Spark submit:

spark-submit 
--master yarn --deploy-mode cluster \
--py-files packages.zip,deps2.zip \
--files /home/sshsanjeev/git/pyspark-example-demo/configs/etl_config.json \
jobs/etl_job.py

Error stack:

Traceback (most recent call last):
  File "etl_job.py", line 51, in <module>
    main()
  File "etl_job.py", line 11, in main
    app_name='my_etl_job',spark_config={'spark.sql.shuffle.partitions':2})
  File "/mnt/resource/hadoop/yarn/local/usercache/sshsanjeev/appcache/application_1555349704365_0218/container_1555349704365_0218_01_000001/packages.zip/dependencies/spark_conn.py", line 20, in start_spark
  File "/usr/hdp/current/spark2-client/python/pyspark/context.py", line 891, in addFile
    self._jsc.sc().addFile(path, recursive)
  File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
  File "/usr/hdp/current/spark2-client/python/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o204.addFile.
: java.io.FileNotFoundException: File file:/mnt/resource/hadoop/yarn/local/usercache/sshsanjeev/appcache/application_1555349704365_0218/container_1555349704365_0218_01_000001/configs/etl_config.json does not exist
    at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:624)
    at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:850)
    at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:614)
    at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:422)
    at org.apache.spark.SparkContext.addFile(SparkContext.scala:1529)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:748)

Did several online search. Followed this article https://community.cloudera.com/t5/Support-Questions/Spark-job-fails-in-cluster-mode/td-p/58772 but still the issue is not resolved.

Please note that I have tried 2 approaches by placing in the config file in the local path of Namenode as well as in the HDFS directory but still getting the same error. Also in client mode this runs successfully. Need guidance

Here is Stack version of my HDP cluster

HDP-2.6.5.3008 YARN 2.7.3 Spark2 2.3.2

Let me know if further info is required. Any suggestions will be highly appreciated.

1
Can you provide error logs (if any) from 'yarn logs -applicationId application_1555349704365_0218' ? - skY
@Sanjeev Roy, was this issue resolved? I am facing same problem :( - Jirilmon

1 Answers

0
votes

It could be related to the permission issue which is not able to create the directory. If the directory is not getting created then it will not have a place holder to place the intermediate results. Hence it fails. The directory referred /mnt/resource/hadoop/yarn/local/usercache/<username>/appcache/<applicationID> is used to store the intermediate results and then it goes to HDFS/memory depending on whether it is written to a path or stored in temp tables respectively. The user might not have permission. Once the job finishes it gets flushed out. Providing correct permissions to the user in the path /mnt/resource/hadoop/yarn/local/user cache in the specific worker node should resolve the issue.