Broadcast join in spark not working for left outer

Question

I have a small table (2k ) records and big table (5 mil) records.I need to fetch all data from small tables and only matching data from large table so to achieve this I have executed below query select /*+ broadcast(small)*/ small.* From small left outer join large Though the query return correct result but when I check the query plan it shows sort merged broadcast hash join. Is there any limitations if small table is left table we can't broadcast and what's the way out then.

Krishna Goje Krishna Goje · Accepted Answer · 2020-09-28T21:07:34

As you want to select complete dataset from small table rather than big table, Spark is not enforcing broadcast join. When you change join sequence or convert to equi-join, spark would happily enforce broadcast join.

Eg:

Big-Table left outer join Small-Table -- Broadcast Enabled
Small-Table left outer join Big-Table -- Broadcast Disabled

Reason: *Spark will share small table a.k.a broadcast table to all data nodes where big table data is present. In your case, we need all the data from small table but only matching data from big table. So spark doesn't know if this record was matched at another data node or even there was no match at all. Due to this ambiguity it cannot select all the records from small table(if this was distributed). So spark is not using Broadcast Join in this case. *

Broadcast join in spark not working for left outer

2 Answers