4
votes

I use Databricks Community Edition.

My Spark program creates multiple jobs. Why? I thought there should be one job and it could have multiple stages.

My Understanding is, when spark program is submitted, it will create one JOB, multiple stages ( usually new stage per shuffle operation ). Below is code being used where I have 2 possible shuffle operations ( reduceByKey / SortByKey ) and one action (Take(5)).

rdd1 = sc.textFile('/databricks-datasets/flights')
rdd2 = rdd1.flatMap(lambda x: x.split(",")).map(lambda x: (x,1)).reduceByKey(lambda x,y:x+y,8).sortByKey(ascending=False).take(5)

Spark job execution screenshot

One more observation, jobs seem to have new stage ( some of them are skipped ), what is causing the new job creation.

1
Perhaps because it's pyspark that you see so many Spark jobs?Jacek Laskowski

1 Answers

2
votes

Generally there will be a job for each action - but sortByKey is really weird - it is technically a transformation (so it should be lazily evaluated) but its implementation requires a eager action to be performed - so for that reason you're seeing a job for the sortByKey plus a job for the take.

That accounts for you seeing 2 of the jobs - I can't see where the third is coming from.

(The skipped stages are where the results of a shuffle are automatically cached - this is an optimization that has been present since around Spark 1.3).

Further information on the sortByKey internals - Why does sortBy transformation trigger a Spark job?