Pyspark write to External Hive table in S3 is not parallel

Question

I have an external hive table defined with a location in s3

LOCATION 's3n://bucket/path/'

When writing to this table at the end of a pyspark job that aggregates a bunch of data the write to Hive is extremely slow because only 1 executor/container is being used for the write. When writing to an HDFS backed table the write happens in parallel and is significantly faster.

I've tried defining the table using the s3a path but my job fails due to some vague errors.

This is on Amazon EMR 5.0 (hadoop 2.7), pyspark 2.0 but I have experienced the same issue with previous versions of EMR/spark.

Is there a configuration or alternative library that I can use to make this write more efficient?

shuaiyuancn shuaiyuancn · Accepted Answer · 2016-08-19T09:22:02

I guess you're using parquet. The DirectParquetOutputCommitter has been removed to avoid potential data loss issue. The change was actually in 04/2016.

It means the data your write to S3 will firstly be saved in a _temporary folder, then "moved" to its final location. Unfortunately "moving" == "copying & deleting" in S3 and it is rather slow. To make it worse, this final "moving" is done only by the driver.

You will have to write to local HDFS then copy the data over (I do recommend this), if you don't want to fight to add that class back. In HDFS "moving" ~ "renaming" so it takes no time.

Pyspark write to External Hive table in S3 is not parallel

1 Answers