I have two large dataframes with the same structure, and roughly the same amount of data, and I'm left-joining them to determine if any of the rows are missing in one DF but not the other. The dataframes are large enough (up to 80 million rows, or ~40Gb) to make the broadcast join not an option. The number of missing rows is generally small, up to 10K for worst cases. The join is on an expression over 2-4 columns (same on both sides), in the form of concat(coalesce(colA, ""), "|", coalesce(colB, ""), "|", coalesce(colC, ""))
, as some of the key columns can be NULL.
We're looking at using partitioning (either prior or after the join), as the shuffle caused by the join seem to cause low performance in our production environment. What's the recommended way to partition the dataframes for an efficient left join?