Apache spark applying map transformation on RDDs

Question

I have a HadoopRDD from which I'm creating a first RDD with a simple Map function then a second RDD from the first RDD with another simple Map function. Something like :

HadoopRDD -> RDD1 -> RDD2.

My question is whether Spak will iterate over the HadoopRDD record by record to generate RDD1 then it will iterate over RDD1 record by record to generate RDD2 or does it ietrate over HadoopRDD and then generate RDD1 and then RDD2 in one go.

maasg maasg · Accepted Answer · 2015-03-06T10:45:23

Short answer: rdd.map(f).map(g) will be executed in one pass.

tl;dr

Spark splits a job into stages. A stage applied to a partition of data is a task.

In a stage, Spark will try to pipeline as many operations as possible. "Possible" is determined by the need to rearrange data: an operation that requires a shuffle will typically break the pipeline and create a new stage.

In practical terms:

Given `rdd.map(...).map(..).filter(...).sort(...).map(...)`

will result in two stages:

.map(...).map(..).filter(...)
.sort(...).map(...)

This can be retrieved from an rdd using rdd.toDebugString The same job example above will produce this output:

val mapped = rdd.map(identity).map(identity).filter(_>0).sortBy(x=>x).map(identity)

scala> mapped.toDebugString
res0: String = 
(6) MappedRDD[9] at map at <console>:14 []
 |  MappedRDD[8] at sortBy at <console>:14 []
 |  ShuffledRDD[7] at sortBy at <console>:14 []
 +-(8) MappedRDD[4] at sortBy at <console>:14 []
    |  FilteredRDD[3] at filter at <console>:14 []
    |  MappedRDD[2] at map at <console>:14 []
    |  MappedRDD[1] at map at <console>:14 []
    |  ParallelCollectionRDD[0] at parallelize at <console>:12 []

Now, coming to the key point of your question: pipelining is very efficient. The complete pipeline will be applied to each element of each partition once. This means that rdd.map(f).map(g) will perform as fast as rdd.map(f andThen g) (with some neglectable overhead)

Apache spark applying map transformation on RDDs

2 Answers

tl;dr