My ETL script reads three tables from a relational database, performs some operations through PySpark and upload this to my S3 bucket (with S3a).
Here's the code that makes the upload:
dataframe.write.mode("overwrite").partitionBy("dt").parquet(entity_path)
I've about 2 million lines which are written on S3 in parquet files partitioned by date ('dt').
My script is taking more than two hours to make this upload to S3 (this is extremely slow) and it's running on Databricks in a cluster with:
3-8 Workers: 366.0-976.0 GB Memory, 48-128 Cores, 12-32 DBU
I've concluded that the problem in on upload, my I can't figure out what's going on.
Update:
Using repartition('dt') the execution time was reduced to ~20 minutes. This helps me, but I think it should execute in less time.