I am learning Apache Spark and trying to clear out the concepts related to caching and persistence of RDDs in Spark.
So according to the documentation of persistence in book "Learning Spark":
To avoid computing an RDD multiple times, we can ask Spark to persist the data. When we ask Spark to persist an RDD, the nodes that compute the RDD store their partitions. Spark has many levels of persistence to choose from based on what our goals are.
In Scala and Java, the default persist() will store the data in the JVM heap as unserialized objects. In Python, we always serialize the data that persist stores, so the default is instead stored in the JVM heap as pickled objects. When we write data out to disk or off-heap storage, that data is also always serialized.
But why is-- the default persist() will store the data in the JVM heap as unserialized objects.