Programmatically reading the output of Hadoop Mapreduce Program

Question

This may be a basic question, but I could not find an answer for it on Google.
I have a map-reduce job that creates multiple output files in its output directory. My Java application executes this job on a remote hadoop cluster and after the job is finished, it needs to read the output programatically using org.apache.hadoop.fs.FileSystem API. Is it possible?
The application knows the output directory, but not the names of the output files generated by the map-reduce job. It seems there is no way to programatically list the contents of a directory in the hadoop file system API. How will the output files be read?
It seems such a commonplace scenario, that I am sure it has a solution. But I am missing something very obvious.

Thomas Jungblut Thomas Jungblut · Accepted Answer · 2011-04-12T12:28:49

The method you are looking for is called listStatus(Path). It simply returns all files inside of a Path as a FileStatus array. Then you can simply loop over them create a path object and read it.

    FileStatus[] fss = fs.listStatus(new Path("/"));
    for (FileStatus status : fss) {
        Path path = status.getPath();
        SequenceFile.Reader reader = new SequenceFile.Reader(fs, path, conf);
        IntWritable key = new IntWritable();
        IntWritable value = new IntWritable();
        while (reader.next(key, value)) {
            System.out.println(key.get() + " | " + value.get());
        }
        reader.close();
    }

For Hadoop 2.x you can setup the reader like this:

 SequenceFile.Reader reader = 
           new SequenceFile.Reader(conf, SequenceFile.Reader.file(path))

Programmatically reading the output of Hadoop Mapreduce Program

3 Answers