Spark Dataproc job failing due to unable to rename error in GCS

Question

I have a spark job which is getting failed due to following error.

 org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 34338.0 failed 4 times, most recent failure: Lost task 0.3 in stage 34338.0 (TID 61601, homeplus-cmp-transient-20190128165855-w-0.c.dh-homeplus-cmp-35920.internal, executor 80): java.io.IOException: Failed to rename FileStatus{path=gs://bucket/models/2018-01-30/model_0002002525030015/metadata/_temporary/0/_temporary/attempt_20190128173835_34338_m_000000_61601/part-00000; isDirectory=false; length=357; replication=3; blocksize=134217728; modification_time=1548697131902; access_time=1548697131902; owner=yarn; group=yarn; permission=rwx------; isSymlink=false} to gs://bucket/models/2018-01-30/model_0002002525030015/metadata/attempt_20190128173835_34338_m_000000_61601/attempt_20190128173835_34338_m_000000_61601/attempt_20190128173835_34338_m_000000_61601/attempt_20190128173835_34338_m_000000_61601/attempt_20190128173835_34338_m_000000_61601/attempt_20190128173835_34338_m_000000_61601/attempt_20190128173835_34338_m_000000_61601/part-00000

I'm unable to figure out what permission is missing, since the Spark job was able to write the temporary files, I'm assuming there are write permissions already.

What Dataproc and GCS connector version do you use? May you share command that you use to create Dataproc cluster and submit the job? Are you running only one job at a time or multiple in parallel? Do they write to the same folder? How frequent this failure? — Igor Dvorzhak
I'm using Dataproc image 1.2.37 so whatever the version comes installed is getting used. I'm using Airflow operator to create the cluster and I'm not passing any extra spark, hive or core properties during cluster creation. I just init the hive metastore using cloud-sql-proxy I'm running multiple jobs but they do write to different location within same bucket something like below gs://bucket/outputfolder/job1 and gs://bucket/outputfolder/job2 I'm seeing this error on every run. — kaysush
May you execute this command on master node: sudo sh -c 'echo "\nlog4j.logger.com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase=DEBUG" >> /etc/spark/conf/log4j.properties'. It will enable debug logs for the class that logs exception that occurs during rename. You will be able to find this log message in in StackDriver using GHFS.rename string. — Igor Dvorzhak
So I figured out that the I had only Storage Legacy Owner role on the bucket. I added Storage Admin role as well and that seem to solve the issue. Thanks. — kaysush

Igor Dvorzhak Igor Dvorzhak · Accepted Answer · 2019-03-15T20:39:33

Per OP comment, issue was in permissions configuration:

So I figured out that the I had only Storage Legacy Owner role on the bucket. I added Storage Admin role as well and that seem to solve the issue. Thanks.

Spark Dataproc job failing due to unable to rename error in GCS

1 Answers