We have a requirement to ingest data from a non-partitioned EXTERNAL hive table work_db.customer_tbl to a partitioned EXTERNAL hive table final_db.customer_tbl through PySpark, previously done through hive query. The final table is partitioned by the column load_date (format of load_date column is yyyy-MM-dd).
So we have a simple PySpark script which uses an insert query (same as the hive query which was used earlier), to ingest the data using spark.sql() command. But we have some serious performance issues because the table we are trying to ingest after ingestion has around 3000 partitions and each partitions has around 4 MB of data except for the last partition which is around 4GB. Total table size is nearly 15GB. Also, after ingestion each partition has 217 files. The final table is a snappy compressed parquet table.
The source work table has a single 15 GB file with filename in the format customers_tbl_unload.dat.
Earlier when we were using the hive query through a beeline connection it usually takes around 25-30 minutes to finish. Now when we are trying to use the PySpark script it is taking around 3 hours to finish.
How can we tune the spark performance to make the ingestion time less than what it took for beeline.
The configurations of the yarn queue we use is:
Used Resources: <memory:5117184, vCores:627>
Demand Resources: <memory:5120000, vCores:1000>
AM Used Resources: <memory:163072, vCores:45>
AM Max Resources: <memory:2560000, vCores:500>
Num Active Applications: 45
Num Pending Applications: 45
Min Resources: <memory:0, vCores:0>
Max Resources: <memory:5120000, vCores:1000>
Reserved Resources: <memory:0, vCores:0>
Max Running Applications: 200
Steady Fair Share: <memory:5120000, vCores:474>
Instantaneous Fair Share: <memory:5120000, vCores:1000>
Preemptable: true
The parameters passed to the PySpark script is:
num-executors=50
executor-cores=5
executor-memory=10GB
PySpark code used:
insert_stmt = """INSERT INTO final_db.customers_tbl PARTITION(load_date)
SELECT col_1,col_2,...,load_date FROM work_db.customer_tbl"""
spark.sql(insert_stmt)
Even after nearly using 10% resources of the yarn queue the job is taking so much time. How can we tune the job to make it more efficient.