Hadoop MapReduce : Number of mappers with TextInputFormat

Question

I am a little confused on the number of mappers spawned by a MapReduce Job.

I have read in a lot of places that the number of mappers does not depend on the number of blocks but on the number of splits i.e. the number of maps is determined by the InputFormat. Mapper= {(total data size)/ (input split size)}

Example- data size is 1 TB and input split size is 128 MB.

Num Mappers = (1*1024*1024)/128 = 8192

The above seems right if I have my input format is FileInputFormat.

But what if my input format is TextInputFormat.

Suppose I have a file of size 1 GB , with default block size of 128MB (in Hadoop 2.x),the number of blocks will be 8.

The file is a text file with each line occupying 1MB.

Total number of lines : 1024
Total number of lines in each block : 128

Now when I set the inputFormat as TextInputFormat, how many mappers will be spawned by Hadoop.

Will it be 1024 ( one for each line) or 8 (one for each block) ?

or maybe just a single mapper, because block size is a HDFS feature, not sure if TextInputFormat uses it — AdamSkywalker

thebluephantom thebluephantom · Accepted Answer · 2019-08-06T19:28:39

You are confusing the issue.

Take this typical example in horrible JAVA Map Reduce:

FileInputFormat.setInputPaths(job, new Path(baseDir, CONTROL_DIR_NAME));
job.setInputFormat(SequenceFileInputFormat.class);

Simply as follows:

FileInputFormat specifies input directory where data files are located to read from. FileInputFormat will read all files and divides these files into one or more InputSplits. So your assertion is correct.
TextInputFormat is the default InputFormat for MapReduce. There are other like SequenceFileInputFormat. The input split is applied always and is orthogonal to the discussion on TextInputFormat.

The former is necessary, the latter optional as there is a default for how to process records in MR.

Hadoop MapReduce : Number of mappers with TextInputFormat

1 Answers