I am using Spark 2.1.1. I have very complex query written in Spark SQL, which I am trying to optimise. For a section, I am trying to use broadcast join. But even though I have set:
spark.sql.autoBroadcastJoinThreshold=1073741824
which is 1GB, I see that the spark generated physical plan for this section of execution is still using SortMergeJoin. Do you have any insights why broadcast join is not used, even though one side's size is shown much lesser (in MB's) on Spark UI -> SQL tab?
My SQL code section for the affected portion looks like:
-- Preceding SQL
(
SELECT /*+ BROADCAST (a) */ -- Size of a is within broadcast threshold as per UI
a.id,
big.gid
FROM
(SELECT DISTINCT(id) AS id FROM a_par WHERE gid IS NULL) a
JOIN
big ON (a.id=big.id)
)
-- Succeeding SQL
