I have an issue with a join
in Spark 2.1. Spark (wrongly?) chooses a broadcast-hash join
although the table is very large (14 million rows). The job then crashes because there is not enough memory and Spark somehow tries to persist the broadcast pieces to disk, which then lead to a timeout.
So, I know there is a query hint to force a broadcast-join (org.apache.spark.sql.functions.broadcast
), but is there also a way to force another join algorithm?
I solved my issue by setting spark.sql.autoBroadcastJoinThreshold=0
, but I would prefer another solution which is more granular, i.e. not disable the broadcast join globally.
yourQuery.explain
to your question? – Jacek Laskowski