Reading Distributed Files in Hadoop

Question

I'm trying to the following in hadoop:

I have implemented a map-reduce job that outputs a file to directory "foo".
the foo files are with a key=IntWriteable, value=IntWriteable format (used a SequenceFileOutputFormat).
Now, I want to start another map-reduce job. the mapper is fine, but each reducer is required to read the entire "foo" files at start-up (I'm using the HDFS for sharing data between reducers).

I used this code on the "public void configure(JobConf conf)":

String uri = "out/foo";
FileSystem fs = FileSystem.get(URI.create(uri), conf);
FileStatus[] status = fs.listStatus(new Path(uri));
for (int i=0; i<status.length; ++i) {
    Path currFile = status[i].getPath();
    System.out.println("status: " + i + " " + currFile.toString());
    try {
        SequenceFile.Reader reader = null;
        reader = new SequenceFile.Reader(fs, currFile, conf);
        IntWritable key = (IntWritable) ReflectionUtils.newInstance(reader.getKeyClass(), conf);
        IntWritable value = (IntWritable ) ReflectionUtils.newInstance(reader.getValueClass(), conf);
        while (reader.next(key, value)) {
        // do the code for all the pairs.
        }
    }
}

The code runs well on a single machine, but I'm notsure if it will run on a cluster. In other words, does this code reads files from the current machine or does id read from the distributed system?

Is there a better solution for what I'm trying to do?

Thanks in advance,

Arik.

That seems like a workable way to do it. out/foo will be on HDFS if the defaultFS is configured to be hdfs. I'd recommend reading the files in the setup() method, so you only do it once. — Matthew Rathbone
Thanks, How to do I set the defaultFS to be HDFS?, In addition is there a diffence between the configure() vs the setup() methods - isn;t it just old vs new api, as they provide the same functionality? — Arik B
you set it in your core-site.xml. fs.default.name to something like hdfs://hadoop-nn:8020. By default if you're running in pseudo-distributed mode it will be set to this anyway (maybe hdfs://localhost:8020). configure and setup are essentially the same, yes. — Matthew Rathbone

Niranjan Sarvi Niranjan Sarvi · Accepted Answer · 2013-04-17T19:31:43

The URI for the FileSystem.get() does not have scheme defined and hence, the File System used depends on the configuration parameter fs.defaultFS. If none set, the default setting i.e LocalFile system will be used.

Your program writes to the Local file system under the workingDir/out/foo. It should work in the cluster as well but looks for the local file system.

With the above said, I'm not sure why you need the entire files from foo directory. You may have consider other designs. If needed, these files should copied to HDFS first and read the files from the overridden setup method of your reducer. Needless to say, to close the files opened in the overridden closeup method of your reducer. While the files can be read in reducers, the map/reduce programs are not designed for this kind of functionality.

Reading Distributed Files in Hadoop

1 Answers