I have an issue with a join in Spark 2.1. Spark (wrongly?) chooses a broadcast-hash join although the table is very large (14 million rows). The job then crashes because there is not enough memory and Spark somehow tries to persist the broadcast pieces to disk, which then lead to a timeout.
So, I know there is a query hint to force a broadcast-join (org.apache.spark.sql.functions.broadcast), but is there also a way to force another join algorithm?
I solved my issue by setting spark.sql.autoBroadcastJoinThreshold=0, but I would prefer another solution which is more granular, i.e. not disable the broadcast join globally.
yourQuery.explainto your question? - Jacek Laskowski