0
votes

I have ORC data on HDFS (non partitioned), ~8billion rows, 250GB in size. Iam reading the data in DF, writing the DF without ay transformations using partitionBy ex: df.write.mode("overwrite").partitionBy("some_column").orc("hdfs path")

As i monitored job status in spark UI - the job and stage is getting completed in 20minutes. But "SQL" tab in spark UI is showing 40minutes.

After running the job in debug mode and going through spark log, i realised the tasks writing to "_temporary" are getting completed in 20minutes.

After that, the merge of "_temporary" to the actual output path is taking 20minutes.

So my question is, is Driver process merging the data from "_temporary" to the output path sequntially? Or is it done by executor tasks?

Is there anything i can do to improve the performance?

1
Try to use repartition for more parallel execution - Srinivas
I always wonder how slow is quantified? ORC format needs to be built, writing is higher than depositing a plain old text file. repartition, but what is your concurrency now? - thebluephantom
i ran with 20 executors with 8 cores, so ~160 tasks. my slowness question is more on merging "_temporary" to final output path. - Chari
if you use repartition .. mostly merging files will be skipped because each executor will write one file in partition directory. say if you use repartition(10), 10 files will be written to partition directory. - Srinivas

1 Answers

1
votes

You may want to check spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version option in your app's config. With version 1, driver does commit temp. files sequentially, which has been known to create a bottleneck. But franky, people usually observe this problem only on a much larger number of files than in your case. Depending on the version of Spark, you may be able to set commit version to 2, see SPARK-20107 for details.

On a separate note, having 8 cores per executor is not recommended as it might saturate disk IO when all 8 tasks are writing output at once.