I'm viewing my job in the Spark Application Master console and I can see in real-time the various stages completing as Spark eats its way through my application's DAG. It all goes reasonably fast. Some stages take less than a second, others take a minute or two.
The final stage, at the top of the list, is rdd.saveAsTextFile(path, classOf[GzipCodec])
. This stage takes a very long time.
I understand that transformations are executed zero, one or many times depending upon the execution plan created as a result of actions like saveAsTextFile
or count
.
As the job progresses, I can see the execution plan in the App Manager. Some stages are not present. Some are present more than once. This is expected. I can see the progress of each stage in realtime (as long as I keep hitting F5 to refresh the page). The execution time is roughly commensurate with the data input size for each stage. Because of this, I'm certain that what the App Manager is showing me is the progress of the actual transformations, and not some meta-activity on the DAG.
So if the transformations are occurring in each of those stages, why is the final stage - a simple write to S3 from EMR - so slow?
If, as my colleague suggest, the transformation stages shown in the App Manager are not doing actual computation, what are they doing that consumes so much memory, CPU & time?