0
votes

I feed my Hadoop program with an input file of size 4MB (which has 100k records). As each HDFS block is 64 MB, and the file fits in only one block, I choose the number of mappers as 1. However, when I increase the number of mappers (let's sat to 24), the running time becomes much better. I have no idea why is that the case? as all the file can be read by only one mapper.

A brief description of the algorithm: The clusters are read from DistributeCache using the configure function, and get stored within a global variable called clusters. The mapper read each chunk line by line and find the cluster to which each line belongs. Here are some of the code:

public void configure(JobConf job){
        //retrieve the clusters from DistributedCache 
        try {               
            Path[] eqFile = DistributedCache.getLocalCacheFiles(job);
            BufferedReader reader = new BufferedReader(new FileReader(eqFile[0].toString()));               


            while((line=reader.readLine())!=null){
                //construct the cluster represented by ``line`` and add it to a global variable called ``clusters``

                }


            reader.close();             

        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
    }

and the mapper

 public void map(LongWritable key, Text value, OutputCollector<IntWritable, EquivalenceClsAggValue> output, Reporter reporter) throws IOException {
         //assign each record to one of the existing clusters in ``clusters''.

        String record = value.toString();
        EquivalenceClsAggValue outputValue = new EquivalenceClsAggValue();
        outputValue.addRecord(record);
        int eqID = MondrianTree.findCluster(record, clusters);
        IntWritable outputKey = new IntWritable(eqID);
        output.collect(outputKey,outputValue);          
    }   

I have input files of different sizes (starting from 4 MB up to 4GB). How can I find the optimal number of mappers/reducers? Each node in my Hadoop cluster has 2 cores and I have 58 nodes.

2
If you have to ask this question, you're probably better off letting Hadoop choose the number of mappers to use.Mike Park
Can you give some more context to your running job - what's the mapper doing to the 4MB of data, how many reducers are you running, are you running a combiner etc?Chris White
Also what's the difference in time between the two jobs - are we talking a few seconds or minutes?Chris White
my program is a simple clustering algorithm. Each mapper has a list of clusters and simply assigns each record (within the data set) to one of the clusters. I have a combiner and I have 24 reducers. The time difference is in minutes.HHH
@ChrisWhite my program is a simple clustering algorithm. Each mapper has a list of clusters and simply assigns each record (within the data set) to one of the clusters. I have a combiner and I have 24 reducers. The time difference is in minutes.HHH

2 Answers

0
votes

as all the file can be read by only one mapper.

This isn't really the case. A few points to keep in mind...

  • That single block is replicated 3 times (by default) which means that three separate nodes have access the same block without having to go over the network
  • There's no reason that a single block can't be copied over to multiple machines where they then seek to the split they have been allocated
0
votes

You need to adjust "mapred.max.split.size". Give the appropriate size in bytes as the value. MR framework will compute the correct # of mappers based on this and block size.