0
votes

I have two dataframe, say sDF (small size) and bDF (big size). I am trying to join them using BroadCastJoin. I have invoked spark shell using

--conf spark.sql.autoBroadcastJoinThreshold=10737418240

and validated same by query:

scala> (spark.conf.get("spark.sql.autoBroadcastJoinThreshold").toLong)/1024/1024
res11: Long = 10240

Smaller size dataframe (sDF) has following info:

scala> sDF.count
res14: Long = 419
scala> sDF.groupBy(spark_partition_id).count.show(1000, false)
+--------------------+-----+
|SPARK_PARTITION_ID()|count|
+--------------------+-----+
|148                 |3    |
|31                  |3    |
......

sDF complete detail can be seen here

Big size dataframe(bDF) has following info:

scala>bDF.groupBy(spark_partition_id).count.show(10000, false)
+--------------------+--------+
|SPARK_PARTITION_ID()|count   |
+--------------------+--------+
|148                 |52996917|
|31                  |52985656|
|137                 |52991784|
|85                  |52990666|
....

bDF complete detail can be seen here

Now in both the cases:

- bDF.join(sDF, ..., "inner")
- bDF.join(broadcast(sDF), ..., "inner")

I am always getting SortMergeJoin in explain. How can I change it to broadcast join?

Spark version: 2.2.1

1

1 Answers

0
votes

In order to force broadcast join, disable SortMergeJoin using

 spark.conf.set("spark.sql.join.preferSortMergeJoin", false)