hadoop get actual number of mappers

Question

In the map phase of my program, I need to know the total number of mappers that are created. This will help me in the key creation process of the map (I want to emit as many key-value pairs for each object as the number of mappers).

I know that setting the number of mappers is just a hint, but what is the way to get the actual number of mappers. I tried the following in the configure method of my Mapper:

public void configure(JobConf conf) {
    System.out.println("map tasks: "+conf.get("mapred.map.tasks"));
    System.out.println("tipid: "+conf.get("mapred.tip.id"));
    System.out.println("taskpartition: "+conf.get("mapred.task.partition"));
}

But I get the results:

map tasks: 1
tipid: task_local1204340194_0001_m_000000
taskpartition: 0
map tasks: 1
tipid: task_local1204340194_0001_m_000001
taskpartition: 1

which means (?) that there are two map tasks, and not just one, as printed (which is quite natural, since I have two small input files). Shouldn't the number after map tasks be 2?

For now, I just count the number of files in the input folder, but this is not a good solution, since a file could be larger than the block size and result in more than one input splits and hence mappers. Any suggestions?

wiki.apache.org/hadoop/HowManyMapsAndReduces It depends on your blocksize and your number of files. So you could actually calculate it outside mapreduce if you 'd want to and then add this number to the distributedCache of your job. — DDW
possible duplicate of Hadoop MapReduce: default number of mappers — harpun
Thank you @irW for the comment! I have something like that already, but I was wondering if there is something like a standard getter, instead of re-implementing a method that already exists and is already called. I will continue with this solution, if there is nothing better, though. — vefthym

vefthym vefthym · Accepted Answer · 2013-11-12T15:00:02

Finally, it seems that conf.get("mapred.map.tasks")) DOES work after all, when I generate an executable jar file and run my program in the cluster/locally. Now the output of "map tasks" is correct.

It did not work only when running my mapreduce program locally on hadoop from the eclipse-plugin. Maybe it is an eclipse-plugin's issue.

I hope this will help someone else having the same issue. Thank you for your answers!

hadoop get actual number of mappers

2 Answers