3
votes

The RDD, which have been cached used the rdd.cache() method from the scala terminal, are being stored in the memory.

That means it will consume some part of the ram being available for the Spark process itself.

Having said that if the ram is being limited, and more and more RDDs have been cached, when will spark clean the memory automatically which has been occupied by the rdd cache?

2
If you want to uncache your RDD, you can try .unpersist(): see stackoverflow.com/questions/25938567/how-to-uncache-rddZouzias
No, i want to know when will Spark do it automatically?KayV
The ContextCleaner is responsible to do so in regular intervals: github.com/apache/spark/blob/master/core/src/main/scala/org/…Zouzias
Cache is cleaned in Least Recently used fashion. Also, the memory allocated for caching is separate from the memory that is used for computation.philantrovert

2 Answers

3
votes

Spark will clean cached RDDs and Datasets / DataFrames:

  • When it is explicitly asked to by calling RDD.unpersist (How to uncache RDD?) / Dataset.unpersist methods or Catalog.clearCache.
  • In regular intervals, by the cache cleaner:

    Spark automatically monitors cache usage on each node and drops out old data partitions in a least-recently-used (LRU) fashion. If you would like to manually remove an RDD instead of waiting for it to fall out of the cache, use the RDD.unpersist() method.

  • When corresponding distributed data structure is garbage collected.

3
votes

Spark will automatically un-persist/clean the RDD or Dataframe if the RDD is not used any longer. To check if a RDD is cached, check into the Spark UI and check the Storage tab and look into the Memory details.

From the terminal, we can use ‘rdd.unpersist() ‘or ‘sqlContext.uncacheTable("sparktable") ‘

to remove the RDD or tables from Memory. Spark made for Lazy Evaluation, unless and until you say any action, it does not load or process any data into the RDD or DataFrame.