Why does Spark Planner prefer sort merge join over shuffled hash join?

Question

Why does Spark Planner in Spark 2.3 prefer a sort merge join over a shuffled hash join? In other words, why is spark.sql.join.preferSortMergeJoin configuration property internal and turned on by default? What's wrong with a shuffled hash join? Is this specific to Spark that it does computations in distributed fashion or something else more inherent in the join algorithm?

You can find the property used in the JoinSelection execution planning strategy here and here that looks like:

case ... if !conf.preferSortMergeJoin && ... =>
  Seq(joins.ShuffledHashJoinExec(...))

Not a developer, haven't analyzed the implementation, but sorting is much simpler to offload from the memory. While theoretical slightly complexity is higher compared to creating hashing structure, getting next row from sorted structures is always O(1), and you get away with a single file descriptor if you really want to push data to disk. Not something you can do if you go with hashing. — zero323
And if I recall correctly Spark done a similar thing with RDD shuffle back in early or so, by switching from hash based to sort based shuffles. — zero323
Why this is voted as "primarily opinion based"? Question is not "which it's better", but "why it was choosen", as you can see there is a good and not primarily opinion based answer — T. Gawęda

Panos Panos · Accepted Answer · 2018-04-25T12:03:04

In order to answer your question

What's wrong with a shuffled hash join?

I'll provide some context first.

According to SPARK-11675 Shuffled Hash Join was removed in Spark 1.6 and the reason was

... I think we should just standardize on sort merge join for large joins for now, and create better implementations of hash joins if needed in the future

and reintroduced in Spark 2.0 according to SPARK-13977 because

ShuffledHashJoin is still useful when:

1) any partition of the build side could fit in memory

2) the build side is much smaller than stream side, the building hash table on smaller side should be faster than sorting the bigger side.

It's worth mentioning the PR for SPARK-13977 which points that Shuffled Hash Join was removed in favor of Sort Merge Join which is faster and more robust.

I'm not sure how much faster is Sort Merge Join compared to Shuffled Hash Join but it's definetelly more robust as Shuffled Hash Join requires the hashed table to fit in memory, counter to Sort Merge Join which can spill to disk.

Why does Spark Planner prefer sort merge join over shuffled hash join?

1 Answers