When I join two dataframes as:
val secondDf= sparkSession.read.parquet(inputPath)
joinedDf = firstDf.join(secondDf, Seq("ID"), "left_outer")
Spark seems to do a broadcast join and no shuffelling is happening.
But as soon as I am caching the smaller Df:
val secondDf= sparkSession.read.parquet(inputPath).cache()
joinedDf = firstDf.join(secondDf, Seq("ID"), "left_outer")
Spark is shuffeling for the join so no broadcast join seems to happen.
My question is: Why is this happening? And how can I avoid the shuffeling when I am caching one dataframe?
Thanks a lot