2
votes

I am learning Apache Spark and trying to clear out the concepts related to caching and persistence of RDDs in Spark.

So according to the documentation of persistence in book "Learning Spark":

To avoid computing an RDD multiple times, we can ask Spark to persist the data. When we ask Spark to persist an RDD, the nodes that compute the RDD store their partitions. Spark has many levels of persistence to choose from based on what our goals are.

In Scala and Java, the default persist() will store the data in the JVM heap as unserialized objects. In Python, we always serialize the data that persist stores, so the default is instead stored in the JVM heap as pickled objects. When we write data out to disk or off-heap storage, that data is also always serialized.

But why is-- the default persist() will store the data in the JVM heap as unserialized objects.

1

1 Answers

3
votes

Because there is no serialization and deserialization overhead making it low cost operation and cached data can be load without additional memory. SerDe is expensive and significantly increase overall cost. And keeping serialized and deserialized objects (particularly with standard Java serialization) can double memory usage in the worst case scenario.