I am trying to implement merge using delta lake oss and my history data is around 7 billions records and delta is around 5 millions.
The merge is based on the composite key(5 columns).
I am spinning up a 10 node cluster r5d.12xlarge(~3TB MEMORY / ~480 CORES).
The job took 35 Minutes for first time and the subsequent runs are taking more time.
Tried using optimization techniques , but nothing worked and i started to get heap memory issues after 3 runs , i see lot spill on disk while data shuffles, tried with re writing the history using order by on merge keys ,got performance improvement and merge completed in 20 minutes and the spill was around 2TB ,however the problem is that the data written as part of merge process was not in same order as I have no control on order of writing data ,so subsequent runs are taking longer .
I was not able to use Zorder in delta lake oss as it only comes with subscription .I tried compaction but that did not help either . Please let me know if there is a better way to optimize the merge process .