Spark 1.6 Dataframe cache not working correctly

Question

My understanding is that if I have a dataframe if I cache() it and trigger an action like df.take(1) or df.count() it should compute the dataframe and save it in memory, And whenever that cached dataframe is called in the program it uses already computed dataframe from cache.

but that is not how my program is working.

I have a dataframe like below which I am caching it, and then immediately I run a df.count action.

val df = inputDataFrame.select().where().withColumn("newcol" , "").cache()
df.count

When I run the program. In Spark UI I see that first line runs for 4 min and when it comes to second line it again runs 4 min basically first line is re computed twice?

Shouldn't first line computed and cached when second line triggers?

how to resolve this behavior. I am stuck, please advise.

Alper t. Turker Alper t. Turker · Accepted Answer · 2017-09-05T20:36:34

My understanding is that if I have a dataframe if I cache() it and trigger an action like df.take(1) or df.count() it should compute the dataframe and save it in memory,

It is not correct. Simple cache and count (take wouldn't work on RDD either) is a valid method for RDDs but it is not the case with Datasets, which use much more advanced optimizations. With query:

df.select(...).where(...).withColumn("newcol" , "").count()

any column, which is not used in where clause can be ignored.

There is an important discussion on the developer list and quoting Sean Owen

I think the right answer is "don't do that" but if you really had to you could trigger a Dataset operation that does nothing per partition. I presume that would be more reliable because the whole partition has to be computed to make it available in practice. Or, go so far as to loop over every element.

Translated to code:

df.foreach(_ => ())

There is

df.registerAsTempTable("df")
sqlContext.sql("CACHE TABLE df")

which is eager but it is no longer (Spark 2 and forward) documented and should be avoided.

Spark 1.6 Dataframe cache not working correctly

2 Answers