Broadcast Hash join with spark dataframe

Question

I am trying to do the broadcast hash join in spark 1.6.0 but could not succeed. below is example:

val DF1 = sqlContext.read.parquet("path1")

val DF2 = sqlContext.read.parquet("path2")


val Join = DF1.as("tc").join(broadcast(DF2.as("st")), Seq("col1"), "left_outer")

Even though I am using broadcast hint the explain on DF shows SortMergeOuterJoin. One reason for this I think is DF2 is greater than 20MB and by default property spark.sql.autoBroadcastJoinThreshold is 10 MB but I am not able to change the property of this variable in spark-shell. am I doing any thing wrong.

I tried as below

spark.sql.autoBroadcastJoinThreshold=100MB

scala> spark.sql.autoBroadcastJoinThreshold=100MB
<console>:1: error: Invalid literal number
       spark.sql.autoBroadcastJoinThreshold=100MB

I need to set this property and try if I can do broadcast hash join and does that improve any performancs. I check many thread on stackoverflow but could not succeed. Can any one please help me here

xmorera xmorera · Accepted Answer · 2017-12-10T22:53:38

Try doing the following:

Edit: here is the Scala code, the Python one is below

scala> spark.conf.get("spark.sql.autoBroadcastJoinThreshold")
res1: String = 10485760

scala> spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "20971520")

scala> spark.conf.get("spark.sql.autoBroadcastJoinThreshold")
res3: String = 20971520

Python code: If my memory serves me well, whenever you pass the SparkConf object it gets cloned so you can't change it on the context, but you can in the session.

First I check current size for the threshold and indeed it is 10 Mb

>>> spark.conf.get('spark.sql.autoBroadcastJoinThreshold')
u'10485760'

Now I create a new session, and don't worry as with DataFrames (yeah...Dataset[Row]) you can have multiple sessions

spark_new = SparkSession.builder.config("spark.sql.autoBroadcastJoinThreshold","20971520").getOrCreate()

And then I confirm that the new configuration value is set

>>> spark_new.conf.get('spark.sql.autoBroadcastJoinThreshold')
u'20971520'

There you go, double the size

Note: I work on Python, but just add a val somewhere among a couple of syntactic sugar differences and you should be fine. Hope it helps or guides you in the right direction

Broadcast Hash join with spark dataframe

1 Answers