How to hint for sort merge join or shuffled hash join (and skip broadcast hash join)?

Question

I have an issue with a join in Spark 2.1. Spark (wrongly?) chooses a broadcast-hash join although the table is very large (14 million rows). The job then crashes because there is not enough memory and Spark somehow tries to persist the broadcast pieces to disk, which then lead to a timeout.

So, I know there is a query hint to force a broadcast-join (org.apache.spark.sql.functions.broadcast), but is there also a way to force another join algorithm?

I solved my issue by setting spark.sql.autoBroadcastJoinThreshold=0, but I would prefer another solution which is more granular, i.e. not disable the broadcast join globally.

Both the tables that you are trying to join are large? Can you try to change the order? if you are doing table1.leftjoin(table2). can you rerun with table2.leftjoin(table1) ? — Sanchit Grover
Just provide the property when creating spark session / sqlcontext for your application then it will be specific to that application, it wont change any global settings. — SanthoshPrasad
try setting it before that join in your driver code and reset it after that particular join executed, but the DAG matters here, if other joins also part of the same job ( same action) then it may affect them as well. — SanthoshPrasad

Jacek Laskowski Jacek Laskowski · Accepted Answer · 2018-04-28T16:54:21

If a broadcast hash join can be used (by the broadcast hint or by total size of a relation), Spark SQL chooses it over other joins (see JoinSelection execution planning strategy).

With that said, don't force a broadcast hash join (using broadcast standard function on the left or right join side) or disable the preference for a broadcast hash join using spark.sql.autoBroadcastJoinThreshold to be 0 or negative.

How to hint for sort merge join or shuffled hash join (and skip broadcast hash join)?

2 Answers