I'm trying to change the Spark staging directory to prevent the loss of data on worker decommisionning (on google dataproc with Spark 2.4).
I want to switch the HDFS staging to Google Cloud Storage staging.
When I run this command :
spark-submit --conf "spark.yarn.stagingDir=gs://my-bucket/my-staging/" gs://dataproc-examples-2f10d78d114f6aaec76462e3c310f31f/src/pyspark/hello-world/hello-world.py
I have this error :
org.apache.spark.SparkException: Application application_1560413919313_0056 failed 2 times due to AM Container for appattempt_1560413919313_0056_000002 exited with exitCode: -1000
Failing this attempt.Diagnostics: [2019-06-20 07:58:04.462]File not found : gs:/my-staging/.sparkStaging/application_1560413919313_0056/pyspark.zip java.io.FileNotFoundException: File not found : gs:/my-staging/.sparkStaging/application_1560413919313_0056/pyspark.zip
The Spark job fails but the .sparkStaging/ directory is created on GCS.
Any idea on this issue ?
Thanks.