3
votes

I am trying to implement a MapReduce job, where each of the mappers would take 150 lines of the text file, and all the mappers would run simmultaniously; also, it should not fail, no matter how many map tasks fail.

Here's the configuration part:

        JobConf conf = new JobConf(Main.class);
        conf.setJobName("My mapreduce");

        conf.set("mapreduce.input.lineinputformat.linespermap", "150");
        conf.set("mapred.max.map.failures.percent","100");

        conf.setInputFormat(NLineInputFormat.class);

        FileInputFormat.addInputPath(conf, new Path(args[0]));
        FileOutputFormat.setOutputPath(conf, new Path(args[1]));

The problem is that hadoop creates a mapper for every single line of text, they seem to run sequentially, and if a single one fails, the job fails.

From this I deduce, that the settings I've applied do not have any effect.

What did I do wrong?

4

4 Answers

3
votes

I assume you are using Hadoop 0.20. In 0.20 the configuration parameter is "mapred.line.input.format.linespermap" and you are using "mapreduce.input.lineinputformat.linespermap". If the configuration parameter is not set then it's defaulted to 1, so you so you are seeing the behavior mentioned in the query.

Here is the code snippet from 0.20 NLineInputFormat.

public void configure(JobConf conf) { N = conf.getInt("mapred.line.input.format.linespermap", 1); }

Hadoop configuration is sometimes a real pain, not documented properly and I have observed that the configuration parameter also keeps changing sometimes between releases. The best bet is to see the code when uncertain of some configuration parameters.

1
votes

To start with "mapred." is old api and "mapreduce." is new api. so you would better not use them together. check which version you are using and stick with that. And also recheck your imports, since there are 2 NLineInputFormat aswell (mapred and mapreduce).

Secondly you can check this link : (gonna paste the important part)

NLineInputFormat will split N lines of input as one split. So, each map gets N lines.

But the RecordReader is still LineRecordReader, which reads one line at time, thereby Key is the offset in the file and Value is the line. If you want N lines as Key, you may to override LineRecordReader.

1
votes

If you want to quickly find the correct names for the options for hadoop's new api, use this link: http://pydoop.sourceforge.net/docs/examples/intro.html#hadoop-0-21-0-notes .

0
votes

The new api's options are mostly undocumented