Reading many files hadoop mapreduce distributed cache

Question

I am having a set of files say 10 files and a single large file which is the sum of all the 10 files.

I ad them in distributed cache, job conf.

When I read them in reduce, i observe the following things:

I read only selected files which are added in the distributed cache in the reduce method. I expected the speed to be faster as the file size read in each reduce is smaller as compared to reading the large file in all the reduce methods. But, it was slower.
Also, when I split it to even smaller files and added them to distributed cache, the problem got worse. The job itself started running only after a long while.

I am unable to find the reason. Pls help.

Amar Amar · Accepted Answer · 2012-11-02T21:01:17

I think your problem lies in reading the file in reduce(). You should read the files in configure() (using old API) or setup() (using the new API). So for every reducer it will be read just once, rather than reading it for each and every input group to the reducer (basically, each call to reduce method)

You can write something like: Using NEW mapreduce API (org.apache.hadoop.mapreduce.*) -

    public static class ReduceJob extends Reducer<Text, Text, Text, Text> {

    ...
Path file1;
Path file2;
...

    @Override
            protected void setup(Context context) throws IOException, InterruptedException {

                // Get the file from distributed cached
    file1 = DistributedCache.getLocalCacheFiles(context.getConfiguration())[0];
    file2 = DistributedCache.getLocalCacheFiles(context.getConfiguration())[1];

                // parse the file and get it's data in-memory for use in reduce method, probably in some ArrayList or HashMap.
            }



            @Override
            protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException,
                    InterruptedException {
    ...
    }
    }

Using OLD mapred API (org.apache.hadoop.mapred.*) -

public static class ReduceJob extends MapReduceBase implements Reducer<Text, Text, Text, Text> {

    ...
Path file1;
Path file2;
...

        @Override
        public void configure(JobConf job) {

                // Get the file from distributed cached
    file1 = DistributedCache.getLocalCacheFiles(job)[0]
    file2 = DistributedCache.getLocalCacheFiles(job)[1]
...

                // parse the file and get it's data in-memory for use in reduce method, probably in some ArrayList or HashMap.
            }


@Override
        public synchronized void reduce(Text key, Iterator<Text> values, OutputCollector<Text, Text> output,
                Reporter reporter) throws IOException {
    ...
    }
    }

Reading many files hadoop mapreduce distributed cache

1 Answers