Do part files which are generated as an output of a mapper only job as part-m-00000,Part-m-00001,so on represent the first input split, second input split and so on and are they generated sequentially ??
1 Answers
May not be. The split array returned by the getSplits() method is sorted into order based on size, so that the biggest go first. This sorted array is passed farther down and map tasks are created for each element. So, the ordering information would be lost when you do the sort.
Reference: org.apache.hadoop.mapreduce.JobSubmitter
class. See method writeSplits(..)
Further reading on how the file names are decided:
Once the task id is determined, the name of the file is decided by the getDefaultWorkFile
API available in org.apache.hadoop.mapreduce.lib.output.FileOutputFormat
class. Here is the documentation:
getDefaultWorkFile public Path getDefaultWorkFile(TaskAttemptContext context, String extension) throws IOException Get the default path and filename for the output format. Parameters: context - the task context extension - an extension to add to the filename Returns: a full path $output/_temporary/$taskid/part-[mr]-$id
This means "part" is postfixed with the task type, 'm' for maps, 'r' for reduces and the task partition number (i.e. task id). For example, the file for the first map of the job the generated name will be 'part-m-00000'.
The older FileOutputFormat
API sitting in org.apache.hadoop.mapred
package also works in a similar way. Here is the reference: https://hadoop.apache.org/docs/r2.4.1/api/org/apache/hadoop/mapred/FileOutputFormat.html#getUniqueName(org.apache.hadoop.mapred.JobConf, java.lang.String)
are they generated sequentially
? – YoungHobbit