I have an RDD which is formed by reading a local text file of size of roughly 117MB.
scala> rdd
res87: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[3] at textFile at <console>:24
I cache the RDD:-
scala> rdd.persist()
res84: rdd.type = MapPartitionsRDD[3] at textFile at <console>:24
After this I call the 'take(1)' action on the RDD to force evalulation. Once this is done, I check the Spark UI's Storage page. It shows me fraction cached is 2% only, size in memory being 6.5MB. Then I call 'count' action on the RDD. After this when I check the Spark UI Storage page I suddenly see that those numbers have now changed. The fraction cached is 82% and size in memory is 258.2MB. Does it mean that even after caching an RDD, Spark only really caches what is really required for the subsequent action (since take(1) only reads one top element)? And when the second action 'count' was triggered, it needed to touch all the elements, so it ended up caching the remaining part as well? I have not come across any documented behavior like this, is it a bug?