I'm trying to the following in hadoop:
- I have implemented a map-reduce job that outputs a file to directory "foo".
- the foo files are with a key=IntWriteable, value=IntWriteable format (used a SequenceFileOutputFormat).
- Now, I want to start another map-reduce job. the mapper is fine, but each reducer is required to read the entire "foo" files at start-up (I'm using the HDFS for sharing data between reducers).
I used this code on the "public void configure(JobConf conf)":
String uri = "out/foo";
FileSystem fs = FileSystem.get(URI.create(uri), conf);
FileStatus[] status = fs.listStatus(new Path(uri));
for (int i=0; i<status.length; ++i) {
Path currFile = status[i].getPath();
System.out.println("status: " + i + " " + currFile.toString());
try {
SequenceFile.Reader reader = null;
reader = new SequenceFile.Reader(fs, currFile, conf);
IntWritable key = (IntWritable) ReflectionUtils.newInstance(reader.getKeyClass(), conf);
IntWritable value = (IntWritable ) ReflectionUtils.newInstance(reader.getValueClass(), conf);
while (reader.next(key, value)) {
// do the code for all the pairs.
}
}
}
The code runs well on a single machine, but I'm notsure if it will run on a cluster. In other words, does this code reads files from the current machine or does id read from the distributed system?
Is there a better solution for what I'm trying to do?
Thanks in advance,
Arik.