Performance tuning for PySpark data load from a non-partitioned hive table to a partitioned hive table

Question

We have a requirement to ingest data from a non-partitioned EXTERNAL hive table work_db.customer_tbl to a partitioned EXTERNAL hive table final_db.customer_tbl through PySpark, previously done through hive query. The final table is partitioned by the column load_date (format of load_date column is yyyy-MM-dd).

So we have a simple PySpark script which uses an insert query (same as the hive query which was used earlier), to ingest the data using spark.sql() command. But we have some serious performance issues because the table we are trying to ingest after ingestion has around 3000 partitions and each partitions has around 4 MB of data except for the last partition which is around 4GB. Total table size is nearly 15GB. Also, after ingestion each partition has 217 files. The final table is a snappy compressed parquet table.

The source work table has a single 15 GB file with filename in the format customers_tbl_unload.dat.

Earlier when we were using the hive query through a beeline connection it usually takes around 25-30 minutes to finish. Now when we are trying to use the PySpark script it is taking around 3 hours to finish.

How can we tune the spark performance to make the ingestion time less than what it took for beeline.

The configurations of the yarn queue we use is:
Used Resources: <memory:5117184, vCores:627>
Demand Resources:   <memory:5120000, vCores:1000>
AM Used Resources:  <memory:163072, vCores:45>
AM Max Resources:   <memory:2560000, vCores:500>
Num Active Applications:    45
Num Pending Applications:   45
Min Resources:  <memory:0, vCores:0>
Max Resources:  <memory:5120000, vCores:1000>
Reserved Resources: <memory:0, vCores:0>
Max Running Applications:   200
Steady Fair Share:  <memory:5120000, vCores:474>
Instantaneous Fair Share:   <memory:5120000, vCores:1000>
Preemptable:    true

The parameters passed to the PySpark script is:
num-executors=50
executor-cores=5
executor-memory=10GB

PySpark code used:
insert_stmt = """INSERT INTO final_db.customers_tbl PARTITION(load_date) 
SELECT col_1,col_2,...,load_date FROM work_db.customer_tbl"""
spark.sql(insert_stmt)

Even after nearly using 10% resources of the yarn queue the job is taking so much time. How can we tune the job to make it more efficient.

Mohd Avais Mohd Avais · Accepted Answer · 2021-03-25T20:01:37

You need to reanalyze your dataset and look if you are using the correct approach by partitioning yoir dataset on date column or should you be probably partitioning on year? To understand why you end up with 200 plus files for each partition, you need to understand the difference between the Spark and Hive partitions. A direct approach you should try first is to read your input dataset as a dataframe and partition it by the key you are planning to use as a partition key in Hive and then save it using df.write.partitionBy

Since the data seems to be skewed too on date column, try partitioning it on additional columns which might have equal distribution of data. Else, filter out the skewed data and process it separately

Performance tuning for PySpark data load from a non-partitioned hive table to a partitioned hive table

1 Answers