7
votes

I have an issue with a join in Spark 2.1. Spark (wrongly?) chooses a broadcast-hash join although the table is very large (14 million rows). The job then crashes because there is not enough memory and Spark somehow tries to persist the broadcast pieces to disk, which then lead to a timeout.

So, I know there is a query hint to force a broadcast-join (org.apache.spark.sql.functions.broadcast), but is there also a way to force another join algorithm?

I solved my issue by setting spark.sql.autoBroadcastJoinThreshold=0, but I would prefer another solution which is more granular, i.e. not disable the broadcast join globally.

2
Both the tables that you are trying to join are large? Can you try to change the order? if you are doing table1.leftjoin(table2). can you rerun with table2.leftjoin(table1) ?Sanchit Grover
@SanchitGrover yes, both are large. Order does not matter.Raphael Roth
Just provide the property when creating spark session / sqlcontext for your application then it will be specific to that application, it wont change any global settings.SanthoshPrasad
try setting it before that join in your driver code and reset it after that particular join executed, but the DAG matters here, if other joins also part of the same job ( same action) then it may affect them as well.SanthoshPrasad
Can you include yourQuery.explain to your question?Jacek Laskowski

2 Answers

5
votes

If a broadcast hash join can be used (by the broadcast hint or by total size of a relation), Spark SQL chooses it over other joins (see JoinSelection execution planning strategy).

With that said, don't force a broadcast hash join (using broadcast standard function on the left or right join side) or disable the preference for a broadcast hash join using spark.sql.autoBroadcastJoinThreshold to be 0 or negative.

4
votes

Along with setting spark.sql.autoBroadcastJoinThreshold to 0 or to a negative value as per Jacek's response, check the state of 'spark.sql.join.preferSortMergeJoin'

Hint for Sort Merge join : Set the above conf to true

Hint for Shuffled Hash join: Set the above conf to false.