I'm trying to execute a spark job through EMR cluster with 6 nodes(8 cores and 56GB memory on each node). Spark job does an incremental load on partitions on Hive table and at the end it does a refresh table in order to update the metadata.
Refresh command takes as long as 3-6 hours to complete which is too long.
Nature of data in Hive:
- 27Gb of data located on S3.
- Stored in parquet.
- Partitioned on 2 columns.(ex: s3a//bucket-name/table/partCol1=1/partCol2=2020-10-12).
Note- Its a date wise partition and cannot be changed.
Spark config used:
- Num-executors= 15
- Executor-memory =16Gb
- Executor-cores = 2
- Driver-memory= 49Gb
- Spark-shuffle-partitions=48
- Hive.exec.dynamic.partition.mode=nonstrict
- Spark.sql.sources.partitionOverwriteMode=dynamic.
Things tried:
- Tuning the spark cores/memory/executors but no luck.
- Refresh table command.
- Alter table add partition command.
- Hive cli taking 3-4 hours to complete
MSCK repair table tablename
All the above had no effect on reducing the time to refresh the partition on Hive.
Some assumptions:
- Am I missing any parameter in tuning as the data is stored in Amazon-S3.?
- Currently number of partitions on table are close to 10k is this an issue.?
Any help will be much appreciated.