1
votes

I'm running a map reduce job with number of reducers set to default (one reducer). In theory, the output must be one file per reducer, but when I run my job I have two files

part-r-00000

and

part-r-00001

Why is this happening ?

There's only one node in my cluster.

My Driver class :

public class DriverDate extends Configured implements Tool {

    @Override
    public int run(String[] args) throws Exception {
        if (args.length != 2) {
            System.out.printf("Usage: AvgWordLength inputDir outputDir\n");
            System.exit(-1);
        }
            Job job = new Job(getConf());
            job.setJobName("Job transformacio dates");

            job.setJarByClass(DriverDate.class);
            job.setMapperClass(MapDate.class);
            job.setReducerClass(ReduceDate.class);

            job.setMapOutputKeyClass(Text.class);
            job.setMapOutputValueClass(NullWritable.class);

            job.setOutputKeyClass(Text.class);
            job.setOutputValueClass(NullWritable.class);


            FileInputFormat.setInputPaths(job, new Path(args[0]));


            FileOutputFormat.setOutputPath(job, new Path(args[1]));

            job.waitForCompletion(true);

        return 0;
    }

    public static void main(String[] args) throws Exception{
        Configuration conf = new Configuration();
        ToolRunner.run(conf,new DriverDate(), args);
    }

}
2
can you post your main method (or Driver class), as well as the command that you execute to run the program?vefthym
There's no other extra configuration and I'm sure that the jar i'm running is the correct one.Arturo Dinaret
Then, I don't have an answer... just wait for someone else.. sorry and good luck!vefthym
What is is the size of your data after the map (the intermediate data)?. if you set the reduce manually to 1, Do you have any retry in the reduce phase.?Abdulrahman
Abdulrahman, I found the answer and you are right, setting the number of the reducers to one explicity is one way in order to solve the problemArturo Dinaret

2 Answers

1
votes

You are right that this code should produce one output file, since the default number of reduce tasks is 1 and each reducer generates one output file.

However, things that might have gone wrong include (but are not limited to):

  • Make sure that you run the correct jar and make sure that you update the correct jar, when generating it. Make sure that you copy the correct jar from the computer that generated it to the master of the (single-node) cluster. For example, in your instructions you say Usage: AvgWordLength inputDir outputDir, but the name of this jar is unlikely to be AvgWordLength...

  • Make sure that you do not specify a different number of reducers from command line (e.g., by using a -D property).

Other than that, I cannot find any other possible cause...

The number of nodes in the cluster is irrelevant.

0
votes

Ok I have found the answer.

In cloudera Manager, configuration option in Yarn (MR2) have the default value of the reducers task per Job, in one node cluster is set to 2, so the number of default reducers are two.

In order to solve this,there are two option, set explicity the number of reducers to one via java with:

job.setNumReduceTasks(1);

, or change the value of default reducers at Yarn Configuration in Cloudera Manager