1
votes

I want to understand, how spark's rdd Persistence helps in fault tolerance.

Let' say, i have 3 nodes in my cluster namely N1,N2,N3. And i am performing spark tasks(transformations-maps) as Rdd1->Rdd2->Rdd3. I have persisted rdd2(on rdd3 count it is successful for first time). On persistence, let us say it has 6 partitions and each of my node have 2 nos of partitions with them and on persistence got them in their RAM(in-memory).

Now on second time while calling Rdd3.count(), N3 goes down, How Spark calculate the Rdd3 count in this case ?

As per doc: "Spark’s cache is fault-tolerant – if any partition of an RDD is lost, it will automatically be recomputed using the transformations that originally created it."

Now as N3 fails, spark will try to recreate Rdd3 from Rdd2, as Rdd3 = rdd2.map() . But as per my understanding, if N3 fails then all the in-memory data/parition of Rdd2 on N3 will also be lost(2 data partition of Rdd2 in N3).

Even if spark tries to recreate Rdd2 also(Rdd1.map), then it has to recompute from beginning(as id Rdd1 would have bben persisted then Rdd1's partition on N3 is lost). It might be applicable for all previous Rdds. As a node goes down, data-trace of any previous Rdd on that Node will also be lost, So is it always a recomputing from beginning(loading file) ?

Please shed some light, Thanks. **** Donot Downvote please ****

1

1 Answers

4
votes

I want to understand, how spark's rdd Persistence helps in fault tolerance.

Spark's cache does not enhance its fault tolerance. Rather Spark RDDs, both cached and uncached, are fault tolerant.

As a node goes down, data-trace of any previous RDD on that Node will also be lost, So is it always a recomputing from beginning(loading file)?

Yes. Spark will go back as far in the RDD's lineage as necessary to recreate the lost data. Previously cached precursor RDDs could theoretically be used to recreate the partition. But, data isn't shuffled unnecessarily between nodes, so the missing partition data won't be on other nodes. So, recomputing the RDD will almost certainly mean recomputing from the beginning & reloading the original data.