I want to understand, how spark's rdd Persistence helps in fault tolerance.
Let' say, i have 3 nodes in my cluster namely N1,N2,N3. And i am performing spark tasks(transformations-maps) as Rdd1->Rdd2->Rdd3. I have persisted rdd2(on rdd3 count it is successful for first time). On persistence, let us say it has 6 partitions and each of my node have 2 nos of partitions with them and on persistence got them in their RAM(in-memory).
Now on second time while calling Rdd3.count(), N3 goes down, How Spark calculate the Rdd3 count in this case ?
As per doc: "Spark’s cache is fault-tolerant – if any partition of an RDD is lost, it will automatically be recomputed using the transformations that originally created it."
Now as N3 fails, spark will try to recreate Rdd3 from Rdd2, as Rdd3 = rdd2.map() . But as per my understanding, if N3 fails then all the in-memory data/parition of Rdd2 on N3 will also be lost(2 data partition of Rdd2 in N3).
Even if spark tries to recreate Rdd2 also(Rdd1.map), then it has to recompute from beginning(as id Rdd1 would have bben persisted then Rdd1's partition on N3 is lost). It might be applicable for all previous Rdds. As a node goes down, data-trace of any previous Rdd on that Node will also be lost, So is it always a recomputing from beginning(loading file) ?
Please shed some light, Thanks. **** Donot Downvote please ****