3
votes

In the map phase of my program, I need to know the total number of mappers that are created. This will help me in the key creation process of the map (I want to emit as many key-value pairs for each object as the number of mappers).

I know that setting the number of mappers is just a hint, but what is the way to get the actual number of mappers. I tried the following in the configure method of my Mapper:

public void configure(JobConf conf) {
    System.out.println("map tasks: "+conf.get("mapred.map.tasks"));
    System.out.println("tipid: "+conf.get("mapred.tip.id"));
    System.out.println("taskpartition: "+conf.get("mapred.task.partition"));
}

But I get the results:

map tasks: 1
tipid: task_local1204340194_0001_m_000000
taskpartition: 0
map tasks: 1
tipid: task_local1204340194_0001_m_000001
taskpartition: 1

which means (?) that there are two map tasks, and not just one, as printed (which is quite natural, since I have two small input files). Shouldn't the number after map tasks be 2?

For now, I just count the number of files in the input folder, but this is not a good solution, since a file could be larger than the block size and result in more than one input splits and hence mappers. Any suggestions?

2
wiki.apache.org/hadoop/HowManyMapsAndReduces It depends on your blocksize and your number of files. So you could actually calculate it outside mapreduce if you 'd want to and then add this number to the distributedCache of your job.DDW
Thank you @irW for the comment! I have something like that already, but I was wondering if there is something like a standard getter, instead of re-implementing a method that already exists and is already called. I will continue with this solution, if there is nothing better, though.vefthym

2 Answers

3
votes

Finally, it seems that conf.get("mapred.map.tasks")) DOES work after all, when I generate an executable jar file and run my program in the cluster/locally. Now the output of "map tasks" is correct.

It did not work only when running my mapreduce program locally on hadoop from the eclipse-plugin. Maybe it is an eclipse-plugin's issue.

I hope this will help someone else having the same issue. Thank you for your answers!

1
votes

I don't think there is an easy way to do this. I've implemented my own InputFormat class, if you do that you can implement a method to count the number of InputSplits which you can request in the process that starts the job. If you put that number in some Configuration setting, you can read it in your mapper process.

btw the number of input files is not always the number of mappers, as large files can be split.