I feed my Hadoop program with an input file of size 4MB (which has 100k records). As each HDFS block is 64 MB, and the file fits in only one block, I choose the number of mappers as 1. However, when I increase the number of mappers (let's sat to 24), the running time becomes much better. I have no idea why is that the case? as all the file can be read by only one mapper.
A brief description of the algorithm: The clusters are read from DistributeCache using the configure
function, and get stored within a global variable called clusters
. The mapper read each chunk line by line and find the cluster to which each line belongs. Here are some of the code:
public void configure(JobConf job){
//retrieve the clusters from DistributedCache
try {
Path[] eqFile = DistributedCache.getLocalCacheFiles(job);
BufferedReader reader = new BufferedReader(new FileReader(eqFile[0].toString()));
while((line=reader.readLine())!=null){
//construct the cluster represented by ``line`` and add it to a global variable called ``clusters``
}
reader.close();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
and the mapper
public void map(LongWritable key, Text value, OutputCollector<IntWritable, EquivalenceClsAggValue> output, Reporter reporter) throws IOException {
//assign each record to one of the existing clusters in ``clusters''.
String record = value.toString();
EquivalenceClsAggValue outputValue = new EquivalenceClsAggValue();
outputValue.addRecord(record);
int eqID = MondrianTree.findCluster(record, clusters);
IntWritable outputKey = new IntWritable(eqID);
output.collect(outputKey,outputValue);
}
I have input files of different sizes (starting from 4 MB up to 4GB). How can I find the optimal number of mappers/reducers? Each node in my Hadoop cluster has 2 cores and I have 58 nodes.