Part files in mapper Output Represent the Split?

Question

Do part files which are generated as an output of a mapper only job as part-m-00000,Part-m-00001,so on represent the first input split, second input split and so on and are they generated sequentially ??

your question is little unclear. The file names are sequential so what do you mean by are they generated sequentially ? — YoungHobbit
by sequentially i mean as what i have studied that which ever mapper executes first it starts giving the output but how are they written?? Does the first split means first mapper and second means second and so on and does part files have any relation with splits ??? — user3101883

Joydip Datta Joydip Datta · Accepted Answer · 2015-12-17T07:22:29

May not be. The split array returned by the getSplits() method is sorted into order based on size, so that the biggest go first. This sorted array is passed farther down and map tasks are created for each element. So, the ordering information would be lost when you do the sort.

Reference: org.apache.hadoop.mapreduce.JobSubmitter class. See method writeSplits(..)

Link to source code: https://svn.apache.org/repos/asf/hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/JobSubmitter.java

Further reading on how the file names are decided:

Once the task id is determined, the name of the file is decided by the getDefaultWorkFile API available in org.apache.hadoop.mapreduce.lib.output.FileOutputFormat class. Here is the documentation:

getDefaultWorkFile

public Path getDefaultWorkFile(TaskAttemptContext context,
                               String extension)
                        throws IOException
Get the default path and filename for the output format.
Parameters:
context - the task context
extension - an extension to add to the filename
Returns:
a full path $output/_temporary/$taskid/part-[mr]-$id

This means "part" is postfixed with the task type, 'm' for maps, 'r' for reduces and the task partition number (i.e. task id). For example, the file for the first map of the job the generated name will be 'part-m-00000'.

Javadoc reference: https://hadoop.apache.org/docs/r2.4.1/api/org/apache/hadoop/mapreduce/lib/output/FileOutputFormat.html#getDefaultWorkFile(org.apache.hadoop.mapreduce.TaskAttemptContext, java.lang.String)

The older FileOutputFormat API sitting in org.apache.hadoop.mapred package also works in a similar way. Here is the reference: https://hadoop.apache.org/docs/r2.4.1/api/org/apache/hadoop/mapred/FileOutputFormat.html#getUniqueName(org.apache.hadoop.mapred.JobConf, java.lang.String)

Part files in mapper Output Represent the Split?

1 Answers