8
votes

When we talk about RDD graphs, does it mean lineage graph or DAG (direct acyclic graph) or both? and when is the lineage graph generated? is it generated before the DAG of Spark tasks?

1

1 Answers

9
votes

An RDD can depend on zero or more other RDDs. For example when you say x = y.map(...), x will depend on y. These dependency relationships can be thought of as a graph.

You can call this graph a lineage graph, as it represents the derivation of each RDD. It is also necessarily a DAG, since a loop is impossible to be present in it.

Narrow dependencies, where a shuffle is not required (think map and filter) can be collapsed into a single stage. Stages are a unit of execution, and they are generated by the DAGScheduler from the graph of RDD dependencies. Stages also depend on each other. The DAGScheduler builds and uses this dependency graph (which is also necessarily a DAG) to schedule the stages.