I have two dataframe, say sDF (small size) and bDF (big size). I am trying to join them using BroadCastJoin. I have invoked spark shell using
--conf spark.sql.autoBroadcastJoinThreshold=10737418240
and validated same by query:
scala> (spark.conf.get("spark.sql.autoBroadcastJoinThreshold").toLong)/1024/1024
res11: Long = 10240
Smaller size dataframe (sDF) has following info:
scala> sDF.count
res14: Long = 419
scala> sDF.groupBy(spark_partition_id).count.show(1000, false)
+--------------------+-----+
|SPARK_PARTITION_ID()|count|
+--------------------+-----+
|148 |3 |
|31 |3 |
......
sDF complete detail can be seen here
Big size dataframe(bDF) has following info:
scala>bDF.groupBy(spark_partition_id).count.show(10000, false)
+--------------------+--------+
|SPARK_PARTITION_ID()|count |
+--------------------+--------+
|148 |52996917|
|31 |52985656|
|137 |52991784|
|85 |52990666|
....
bDF complete detail can be seen here
Now in both the cases:
- bDF.join(sDF, ..., "inner")
- bDF.join(broadcast(sDF), ..., "inner")
I am always getting SortMergeJoin in explain. How can I change it to broadcast join?
Spark version: 2.2.1