How does Spark reduces shuffle while Joining tables, when underlying tables are bucketed

Question

I have read at Multiple forums that Shuffle is reduced while doing Sort Merge Join when your underlying tables are bucketed and sorted. However, My question is following

A sorted Bucket will only guarantee that data in a bucket is about the same set of keys and data is sorted. Assume We have 2 data frames d1 and d2, both are sorted and bucketed.

Does spark guarantee that bucketx of d1 table containing key1 and key2 data is on the same machine as buckety of d2 table containing key1 and key2?

If bucketx and buckety are guaranteed to be on same machine, then there will be no Exchange across nodes while doing sort-merge join. if they can sit on different machines. then there should be data exchange while doing join.

Please help to understand this concept. Thanks in advance.

Samir Vyas Samir Vyas · Accepted Answer · 2020-08-15T16:11:03

Your understanding is correct. SortMergeJoin requires RangePartitioning of data.

If your dataframes df1 and df2 are already partitioned by a RangePartitioner on key k (which is also used in join) then there would be no extra exchange, otherwise there will be.

How does Spark reduces shuffle while Joining tables, when underlying tables are bucketed

1 Answers