0
votes

I am new to Spark SQL. My role involves writing Spark sql queries for data transformation. Recently I got introduced to Broadcast Hash Join (BHJ) in Spark SQL. I understand that a BHJ performs very well when the broadcasted table is very small and can be induced by using query hints. For e.g.

select /*+ BROADCAST(B) */
*
from A
Left Join B
on A.id = B.id;

I have also read that there are 2 types of Broadcast Joins - Driver BHJ & Executor BHJ (the latter yields better performance).

Hence, when I use a Broadcast hint in my query, does Spark use a Driver BHJ or an Executor BHJ ?

How can I command Spark (via hints etc) to induce an Executor BHJ instead of a Driver BHJ ?

I use Spark SQL 2.4.

Thanks

1

1 Answers

0
votes

For spark 2.4 you don’t t have to mention any hints.

You can set below spark config to set the limit of small data set which can be broadcasted, by default it is 10 Mb.

spark.conf.set("spark.sql.autoBroadcastJoinThreshold" "<value in kb>")