Apache Spark: Broadcast join not workling for cached dataframe

Question

When I join two dataframes as:

 val secondDf= sparkSession.read.parquet(inputPath)
 joinedDf = firstDf.join(secondDf, Seq("ID"), "left_outer")

Spark seems to do a broadcast join and no shuffelling is happening.

But as soon as I am caching the smaller Df:

 val secondDf= sparkSession.read.parquet(inputPath).cache()
 joinedDf = firstDf.join(secondDf, Seq("ID"), "left_outer")

Spark is shuffeling for the join so no broadcast join seems to happen.

My question is: Why is this happening? And how can I avoid the shuffeling when I am caching one dataframe?

Thanks a lot

user2682459 user2682459 · Accepted Answer · 2018-05-09T20:20:03

Try

firstDf.join(broadcast(secondDf), Seq....)

Not sure why caching should make a difference, Spark is a bit unpredictable sometimes.

You could try writing secondDf to disk and reading it back in instead of caching, if it's small the overhead of doing this will be minimal.