0
votes

I'm trying to execute a spark job through EMR cluster with 6 nodes(8 cores and 56GB memory on each node). Spark job does an incremental load on partitions on Hive table and at the end it does a refresh table in order to update the metadata.

Refresh command takes as long as 3-6 hours to complete which is too long.

Nature of data in Hive:

  1. 27Gb of data located on S3.
  2. Stored in parquet.
  3. Partitioned on 2 columns.(ex: s3a//bucket-name/table/partCol1=1/partCol2=2020-10-12).

Note- Its a date wise partition and cannot be changed.

Spark config used:

  • Num-executors= 15
  • Executor-memory =16Gb
  • Executor-cores = 2
  • Driver-memory= 49Gb
  • Spark-shuffle-partitions=48
  • Hive.exec.dynamic.partition.mode=nonstrict
  • Spark.sql.sources.partitionOverwriteMode=dynamic.

Things tried:

  • Tuning the spark cores/memory/executors but no luck.
  • Refresh table command.
  • Alter table add partition command.
  • Hive cli taking 3-4 hours to complete MSCK repair table tablename

All the above had no effect on reducing the time to refresh the partition on Hive.

Some assumptions:

  1. Am I missing any parameter in tuning as the data is stored in Amazon-S3.?
  2. Currently number of partitions on table are close to 10k is this an issue.?

Any help will be much appreciated.

1

1 Answers

0
votes
  • incase possible, make the partitions to 1 column. It kills when we have multi level (multi column partitions)

  • use R type instance. It provides more memory compared to M type instances at same price

  • use coalesce to merge the files in source if there are many small files.

  • Check the number of mapper tasks. The more the task, lesser the performance

  • use EMRFS rather than S3 to keep the metadata info

  • use below

    { "Classification": "spark", "Properties": { "maximizeResourceAllocation": "true" } }

  • Follow some of the instructions from below Link