I have read at Multiple forums that Shuffle is reduced while doing Sort Merge Join when your underlying tables are bucketed and sorted. However, My question is following
A sorted Bucket will only guarantee that data in a bucket is about the same set of keys and data is sorted. Assume We have 2 data frames d1 and d2, both are sorted and bucketed.
- Does spark guarantee that bucketx of d1 table containing key1 and key2 data is on the same machine as buckety of d2 table containing key1 and key2?
If bucketx and buckety are guaranteed to be on same machine, then there will be no Exchange across nodes while doing sort-merge join. if they can sit on different machines. then there should be data exchange while doing join.
Please help to understand this concept. Thanks in advance.