I have two RDDs. One RDD is between 5-10 million entries and the other RDD is between 500 million - 750 million entries. At some point, I have to join these two rdds using a common key.
val rddA = someData.rdd.map { x => (x.key, x); } // 10-million
val rddB = someData.rdd.map { y => (y.key, y); } // 600-million
var joinRDD = rddA.join(rddB);
When spark decides to do this join, it decides to do a ShuffledHashJoin. This causes many of the items in rddB to be shuffled on the network. Likewise, some of rddA are also shuffled on the network. In this case, rddA is too "big" to use as a broadcast variable, but seems like a BroadcastHashJoin would be more efficient. Is there to hint to spark to use a BroadcastHashJoin? (Apache Flink supports this through join hints).
If not, is the only option to increase the autoBroadcastJoinThreshold?
Update 7/14
My performance issue appears to be squarely rooted in repartitioning. Normally, an RDD read from HDFS would be partitioned by block, but in this case, the source was a parquet datasource [that I made]. When spark (databricks) writes the parquet file, it writes one file per partition, and identically, it reads one partition per file. So, the best answer I've found is that during production of the datasource, to partition it by key then, write out the parquet sink (which is then naturally co-partitioned) and use that as rddB.
The answer given is correct, but I think the details about parquet datasource may be useful to someone else.