Spark 2.0+ Even the dataframe is cached, if one of its source changes, it would recompute?

Question

Here is my use case.

have multiple sources df1 to df4, df3 represents an existing hive table
build a df5 from df1 to df4
insert/append df5 to that existing hive table.
save df5 to other spot.

The problem is step 4 saves nothing to the spot. Does that mean after step 3, df3 would change? I already use cache() for df1 to df5. But It looks like the df5 would recompute if the source has been changed I checked Spark Web UI storage. all the dataframe are 100% cached.

user10807171 user10807171 · Accepted Answer · 2018-12-18T16:23:41

In general you shouldn't depend on this behavior in either direction. There is no mechanism in Spark that could track changes in an arbitrary data source, therefore picking up changes in general is rather incidental, and cannot be take for granted.

At the same time Spark can choose to recompute in many different scenarios.

In some cases Spark can also detect changes (typically if data is loaded from files) and throw an exception.

Spark 2.0+ Even the dataframe is cached, if one of its source changes, it would recompute?

1 Answers