PySpark extremely slow uploading to S3 running on Databricks

Question

My ETL script reads three tables from a relational database, performs some operations through PySpark and upload this to my S3 bucket (with S3a).

Here's the code that makes the upload:

dataframe.write.mode("overwrite").partitionBy("dt").parquet(entity_path)

I've about 2 million lines which are written on S3 in parquet files partitioned by date ('dt').

My script is taking more than two hours to make this upload to S3 (this is extremely slow) and it's running on Databricks in a cluster with:

 3-8 Workers: 366.0-976.0 GB Memory, 48-128 Cores, 12-32 DBU

I've concluded that the problem in on upload, my I can't figure out what's going on.

Update: Using repartition('dt') the execution time was reduced to ~20 minutes. This helps me, but I think it should execute in less time.

Lucas Mendes Mota Da Fonseca Lucas Mendes Mota Da Fonseca · Accepted Answer · 2019-05-09T21:56:06

As I've updated on the question, adding repartition('dt') the execution time was reduced to ~13 to 20 minutes.

dataframe.write.mode("overwrite").partitionBy("dt").parquet(entity_path)

After some analyses, I've concluded the cluster was processing the upload serialized and the files were being uploaded one by one in asc order by date in S3.

So adding the repartition, the cluster reorganizes the data between its nodes and uploads files randomically making the upload faster (from ~3 hours to 20 minutes).

This solution helped me. If anyone knows a better approach or have any contributions I'll be glad to know.

PySpark extremely slow uploading to S3 running on Databricks

2 Answers