Optimizing Merge in Delta Lake (Databricks Open Source )

Question

I am trying to implement merge using delta lake oss and my history data is around 7 billions records and delta is around 5 millions.

The merge is based on the composite key(5 columns).

I am spinning up a 10 node cluster r5d.12xlarge(~3TB MEMORY / ~480 CORES).

The job took 35 Minutes for first time and the subsequent runs are taking more time.

Tried using optimization techniques , but nothing worked and i started to get heap memory issues after 3 runs , i see lot spill on disk while data shuffles, tried with re writing the history using order by on merge keys ,got performance improvement and merge completed in 20 minutes and the spill was around 2TB ,however the problem is that the data written as part of merge process was not in same order as I have no control on order of writing data ,so subsequent runs are taking longer .

I was not able to use Zorder in delta lake oss as it only comes with subscription .I tried compaction but that did not help either . Please let me know if there is a better way to optimize the merge process .

Shubham Jain Shubham Jain · Accepted Answer · 2020-07-28T05:47:46

Here's a recommendation, it seems you are running your databricks notebook on AWS.

The way to optimize it to use Hive metastore or any catalog service alongside. Now how this will help?

While saving the data you can use bucketing to order you data according to the merge keys and this metadata information needs to be stored in the metastore which will require hive.

If you use bucketing the data will be in order and will not result in excessive shuffling of data which will inevitably improve the performance of your job.

I am not very sure about databricks but if you use EMR you gets the options to use glue catalog as metastore or you can have your own metastore in EMR also.

Optimizing Merge in Delta Lake (Databricks Open Source )

1 Answers