If a Spark stage has completed, is the computation done?

Question

I'm viewing my job in the Spark Application Master console and I can see in real-time the various stages completing as Spark eats its way through my application's DAG. It all goes reasonably fast. Some stages take less than a second, others take a minute or two.

The final stage, at the top of the list, is rdd.saveAsTextFile(path, classOf[GzipCodec]). This stage takes a very long time.

I understand that transformations are executed zero, one or many times depending upon the execution plan created as a result of actions like saveAsTextFile or count.

As the job progresses, I can see the execution plan in the App Manager. Some stages are not present. Some are present more than once. This is expected. I can see the progress of each stage in realtime (as long as I keep hitting F5 to refresh the page). The execution time is roughly commensurate with the data input size for each stage. Because of this, I'm certain that what the App Manager is showing me is the progress of the actual transformations, and not some meta-activity on the DAG.

So if the transformations are occurring in each of those stages, why is the final stage - a simple write to S3 from EMR - so slow?

If, as my colleague suggest, the transformation stages shown in the App Manager are not doing actual computation, what are they doing that consumes so much memory, CPU & time?

gsamaras gsamaras · Accepted Answer · 2018-06-01T09:59:43

In Spark lazy evaluation is a key concept, and a concept you'd better get familiar with if you want to work with Spark.

The stages you witness to complete too fast do not do any significant computation.

If they are not doing actual computation, what are they doing?

They are updating the DAG.

When an action is triggered, then Spark has the chance to consult the DAG to optimize computation (something that wouldn't be possible without lazy optimization).

For more, read Spark Transformation - Why its lazy and what is the advantage?

Moreover, I think your colleague rushed to give you an answer, and mistakenly said:

transformation are cheap

The truth lies in ref's RDD operations:

All transformations in Spark are lazy, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program.

Cheap is not the right word.

That explains why, in the end of the day, the final stage of yours (that actually asks for data and triggers the action) is so slow in comparison with the other tasks.

I mean every stage you mention does not seem to trigger any action. As a result, the final stage has to take into account all of the prior stages, and do all the work needed, but remember, in an optimized Spark-viewpoint.

If a Spark stage has completed, is the computation done?

2 Answers