I am new to Spark and have a question around Broadcast Joins. We use Spark 2.4.0 and use Spark Temporary Views for data transformations -
create temporary view product as
select /*+ BROADCAST(b) */
a.custid, b.prodid
from cust a
join prod b
on a.prodid = b.prodid
I know there is a parameter for broadcast joins spark.sql.autoBroadcastJoinThreshold which has a value of 10 i.e. 10MB. But, I also read somewhere that the maximum size of a broadcast table could be 8GB. What is the significance of these two values ?
For broadcasting a table/view, does the size always have to be in MB or, is it also possible to broadcast (with a hint) a table/view of 5GB (for example) ? In that case, do I have to manipulate the value of the parameter spark.sql.autoBroadcastJoinThreshold by setting it to a higher value (=5120 i.e. 5GB) ? Or, will it allow me to broadcast the table/view since it is below the max limit (i.e. 8GB) ?
If I want to broadcast a table (in GBs) is it recommended from a query performance perspective ?
Any help is appreciated.
Thanks