0
votes

I run Spark SQL queries and use them to perform data transformations and then store the final result-set (after a series of transformation steps) to S3.

I recently noticed that one of my jobs was creating a large number of partition files while writing to S3and taking a long time to complete (in fact it was failing). So I want to know if there is any way to do a COALESCE like function in the SQL API to reduce the number of partitions before writing to S3 ?

I know that the SQL API equivalent of re-partition is Cluster By. So was wondering if there is anything similar for COALESCE operation as well in the SQL API too.

Please note that I only have access to the SQL API so my question strictly pertains to Spark SQL API only. (For e.g. SELECT col from TABLE1 WHERE ...)

We use Spark SQL 2.4.6.7

Thanks

1

1 Answers

0
votes

The docs suggest using hints to coalesce partitions, e.g.

SELECT /*+ COALESCE(3) */ col from TABLE1