Alter table to add partition taking long time on Hive external table

Question

I'm trying to execute a spark job through EMR cluster with 6 nodes(8 cores and 56GB memory on each node). Spark job does an incremental load on partitions on Hive table and at the end it does a refresh table in order to update the metadata.

Refresh command takes as long as 3-6 hours to complete which is too long.

Nature of data in Hive:

27Gb of data located on S3.
Stored in parquet.
Partitioned on 2 columns.(ex: s3a//bucket-name/table/partCol1=1/partCol2=2020-10-12).

Note- Its a date wise partition and cannot be changed.

Spark config used:

Num-executors= 15
Executor-memory =16Gb
Executor-cores = 2
Driver-memory= 49Gb
Spark-shuffle-partitions=48
Hive.exec.dynamic.partition.mode=nonstrict
Spark.sql.sources.partitionOverwriteMode=dynamic.

Things tried:

Tuning the spark cores/memory/executors but no luck.
Refresh table command.
Alter table add partition command.
Hive cli taking 3-4 hours to complete MSCK repair table tablename

All the above had no effect on reducing the time to refresh the partition on Hive.

Some assumptions:

Am I missing any parameter in tuning as the data is stored in Amazon-S3.?
Currently number of partitions on table are close to 10k is this an issue.?

Any help will be much appreciated.

BigData-Guru BigData-Guru · Accepted Answer · 2020-12-09T12:44:01

incase possible, make the partitions to 1 column. It kills when we have multi level (multi column partitions)
use R type instance. It provides more memory compared to M type instances at same price
use coalesce to merge the files in source if there are many small files.
Check the number of mapper tasks. The more the task, lesser the performance
use EMRFS rather than S3 to keep the metadata info
use below

{ "Classification": "spark", "Properties": { "maximizeResourceAllocation": "true" } }
Follow some of the instructions from below Link

Alter table to add partition taking long time on Hive external table

1 Answers