I have an external hive table defined with a location in s3
LOCATION 's3n://bucket/path/'
When writing to this table at the end of a pyspark job that aggregates a bunch of data the write to Hive is extremely slow because only 1 executor/container is being used for the write. When writing to an HDFS backed table the write happens in parallel and is significantly faster.
I've tried defining the table using the s3a path but my job fails due to some vague errors.
This is on Amazon EMR 5.0 (hadoop 2.7), pyspark 2.0 but I have experienced the same issue with previous versions of EMR/spark.
Is there a configuration or alternative library that I can use to make this write more efficient?