Writing DataFrame to Parquet or Delta Does not Seem to be Parallelized - Taking Too Long

Question

Problem Statement

I've read a partitioned CSV file into a Spark Dataframe.

In order to leverage the improvements of Delta Tables I'm trying to simply export it as Delta in a directory inside an Azure Data Lake Storage Gen2. I'm using the code below in a Databricks notebook:

%scala

df_nyc_taxi.write.partitionBy("year", "month").format("delta").save("/mnt/delta/")

The whole dataframe has around 160 GB.

Hardware Specs

I'm running this code using a cluster with 12 Cores and 42 GB of RAM.

However looks like the whole writing process is being handled by Spark/Databricks sequentially, e.g. non-parallel fashion:

The DAG Visualization looks like the following:

All in all looks like this will take 1-2 hours to execute.

Questions

Is there a way to actually make Spark write to different partitions in parallel?
Could it be that the problem is that I'm trying to write the delta table directly to the Azure Data Lake Storage?

try repartition(your_partition_columns).write.partitionBy("year", "month") — eliasah
Thanks for the input @eliasah. Doesn't repartition expect an integer rather than a list of columns? — born to hula
When I try: df_nyc_taxi.repartition("year", "month").write.partitionBy("year", "month").format("delta").save("/mnt/delta/") I get: error: overloaded method value repartition with alternatives: (partitionExprs: org.apache.spark.sql.Column*)org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] <and> (numPartitions: Int,partitionExprs: org.apache.spark.sql.Column*)org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] cannot be applied to (String, String) df_nyc_taxi.repartition("year", "month").write.partitionBy("year", "month").format("delta").save("/mnt/delta/") — born to hula
repartition can take a list of columns as repeated arguments. check this : stackoverflow.com/questions/52521067/… — eliasah

Long Vu Long Vu · Accepted Answer · 2020-02-03T11:24:50

To follow up on @eliasah comment perhaps you can try this:

import org.apache.spark.sql.functions
df_nyc_taxi.repartition(col("year"), col("month"), lit(rand() * 200)).write.partitionBy("year", "month").format("delta").save("/mnt/delta/")

The answer from @eliasah most likely will create only one file for each directory "/mnt/delta/year=XX/month=XX", and only one worker will write the data to each file. The extra columns will further slice the data (in this case I'm dividing the data in each original file to 200 smaller partitions, you can edit it if you like), so that more worker can write concurrently.

P.S: sry I don't have enough rep to comment yet :'D

Writing DataFrame to Parquet or Delta Does not Seem to be Parallelized - Taking Too Long

2 Answers