My understanding is that if I have a dataframe if I cache() it and trigger an action like df.take(1)
or df.count() it should compute the dataframe and save it in memory, And whenever that cached dataframe is called in the program it uses already computed dataframe from cache.
but that is not how my program is working.
I have a dataframe like below which I am caching it, and then immediately I run a df.count
action.
val df = inputDataFrame.select().where().withColumn("newcol" , "").cache()
df.count
When I run the program. In Spark UI I see that first line runs for 4 min and when it comes to second line it again runs 4 min basically first line is re computed twice?
Shouldn't first line computed and cached when second line triggers?
how to resolve this behavior. I am stuck, please advise.