I ran the following job in the spark-shell:
val d = sc.parallelize(0 until 1000000).map(i => (i%100000, i)).persist
d.join(d.reduceByKey(_ + _)).collect
The Spark UI shows three stages. Stage 4 and 5 correspond to the computation of d
, and stage 6 corresponds to the computation of the collect
action. Since d
is persisted, I would expect only two stages. However stage 5 is present not connected to any other stages.
So tried running the same computation without using persist, and the DAG looks like identically, except without the green dots indicating the RDD has been persisted.
I would expect the output of stage 11 to be connect to the input of stage 12, but it is not.
Looking at the stage descriptions, the stages seem to indicate that d
is being persisted, because stage 5 has input, but I am still confused as to why stage 5 even exists.