1
votes

I have two large dataframes with the same structure, and roughly the same amount of data, and I'm left-joining them to determine if any of the rows are missing in one DF but not the other. The dataframes are large enough (up to 80 million rows, or ~40Gb) to make the broadcast join not an option. The number of missing rows is generally small, up to 10K for worst cases. The join is on an expression over 2-4 columns (same on both sides), in the form of concat(coalesce(colA, ""), "|", coalesce(colB, ""), "|", coalesce(colC, "")), as some of the key columns can be NULL.

We're looking at using partitioning (either prior or after the join), as the shuffle caused by the join seem to cause low performance in our production environment. What's the recommended way to partition the dataframes for an efficient left join?

1
Using multiple clauses for the join (rather than concatenating everything) and partitioning based on columns in that order may give you a good performance boost.ktdrv
Also, have you considered using a Bloom filter or something similar?ktdrv

1 Answers

2
votes

First of all partitioning can improve performance of the join operation only, if the join operation will be performed multiple times without changing at least one DataFrame.

Similar answer is already provided here - using partitioning prior to one-time join will just cause shuffle in a different place.

Using partitioning after join will not have any positive effect.

Also partitioning effect on join operations topic is explained in this book at the page 61.