3
votes

Hello stackoverflow community.

I ask your help in understanding if my thoughts are correct or I'm missing some points in my Spark job.

I currently have two rdds that I want to subtract. Both the rdds are built as different transformations on the same father RDD.

First of all, the father RDD is cached after it is obtained:

val fatherRdd = grandFather.repartition(n).mapPartitions(mapping).cache

Then the two rdds are transformed. One is (pseudocode):

son1= rddFather.filter(filtering_logic).map(take_only_key).distinct

The other one is :

son2= rddFather.filter(filtering_logic2).map(take_only_key).distinct

The two sons are then subtracted to obtain only the keys in son1:

son1.subtract(son2)

I would expect the squence of the transformations to be the following:

  1. mapPartitions
  2. repartition
  3. caching

Then, starting from cached data, map filter map distinct on both rdds and then subtracting.

This is not happening, what I see are the two distinct operations running in parallel, apparently not exploiting the benefits of caching (there are no skipped tasks), and taking almost the same computation time. Below the image of the dag taken from spark ui. enter image description here

Do you have any suggestions for me?

1

1 Answers

3
votes

You are correct in your observations. Transformations on RDDs are lazy, so caching will happen after the first time the RDD is actually computed.

If you call an action on your parent RDD, it should be computed and cached. Then your subsequent operations will operate on the cached data.