7
votes

In spark-summit 2014, Aaron gives the speak A Deeper Understanding of Spark Internals , in his slide, page 17 show a stage has been splited into 4 tasks as bellow:
enter image description here

Here I wanna know three things about how does a stage be splited into tasks?

  1. in this example above, it seems that tasks' number are created based on the file number, am I right?

  2. if I'm right in point 1, so if there was just 3 files under directory names, will it just create 3 tasks?

  3. If I'm right in point 2, what if there is just one but very large file? Does it just split this stage into 1 task? And what if when the data is coming from a streaming data source?

thanks a lot, I feel confused in how does the stage been splited into tasks.

4

4 Answers

1
votes

You can configure the # of partitions (splits) for the entire process as the second parameter to a job, e.g. for parallelize if we want 3 partitions:

a = sc.parallelize(myCollection, 3)

Spark will divide the work into relatively even sizes (*) . Large files will be broken down accordingly - you can see the actual size by:

rdd.partitions.size

So no you will not end up with single Worker chugging away for a long time on a single file.

(*) If you have very small files then that may change this processing. But in any case large files will follow this pattern.

0
votes

The split occurs in two stages:

Firstly HDSF splits the logical file into 64MB or 128MB physical files when the file is loaded.

Secondly SPARK will schedule a MAP task to process each physical file. There is a fairly complex internal scheduling process as there are three copies of each physical file stored on three different servers, and, for large logical files it may not be possible to run all the tasks at once. The way this is handled is one of the major differences between hadoop distributions.

When all the MAP tasks have run the collectors, shuffle and reduce tasks can then be run.

0
votes

Stage: New stage will get created when a wide transformation occurs

Task: Will get created based on partitions in a worker

Attaching the link for more explanation: How DAG works under the covers in RDD?

0
votes

Question 1: in this example above, it seems that tasks' number are created based on the file number, am I right? Answer : its not based on the filenumber, its based on your hadoop block(0.gz,1.gz is a block of data saved or stored in hdfs. )

Question 2: if I'm right in point 1, so if there was just 3 files under directory names, will it just create 3 tasks? Answer : By default block size in hadoop is of 64MB and that block of data will be treated as partition in spark. Note : no of partitions = no of task, because of these it has created 3tasks.

Question 3 : what if there is just one but very large file? Does it just split this stage into 1 task? And what if when the data is coming from a streaming data source? Answer : No, the very large file will be partitioned and as i answered for ur question 2 based on the no of partitions , no of task will be created