1
votes

Spark is clearing the cached RDD after a write action is invoked on a dataframe arrived from this cache after some transformations. So any further action that could have used the cache has to recompute the RDD. However if the action is replaced by any other action like count or take the cache persists and can be used in subsequent operations.

Why does it happen?

1
Did you manage to resolve this issue eventually? - Assaf Neufeld
I did not. But as a work around, I started writing the dataframe to a csv instead of caching it, and then do a csv read, thus replicating the cache behavior. The performance is not as good as caching plus it also creates csv files to be cleaned later. The issue seems to be in InserIntoHiveTable.scala code ( github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/…) which triggers an un-cache. I think I can raise a Spark bug for this now. - Bay Max
Yeah we’re getting the same behavior when we do cache -> save parquet -> suddenly the data frame is empty.. - Assaf Neufeld

1 Answers

1
votes

You can first use one action such as df.count() after you cache the dataframe, then the dataframe will be cached. Use write() after cache has been trigered by other actions.