one mapper or a reducer to process one file or directory

Question

I am new to Hadoop and MapReduce. I have some directory and files within this (each file 10 MB big and N could be 100. Files may be compressed or uncompressed) like: MyDir1/file1 MyDir1/file2 ... MyDir1/fileN

MyDir2/file1 MyDir2/file2 ... MyDir3/fileN

I want to design a MapReduce application where one mapper or reducer would process entire MyDir1 i.e. I dont want the MyDir1 to be split across multiple mappers. Similarly I want MyDir2 to be processed by other mapper/reducer completely without splitting.

Any idea on how to go about this? Do I need to write my own InputFormat and read the input files?

I actually have the same 2 requirements. I need the file not to be split because there is header information at the top of the file. I need a directory per mapper so that I can process the files in that directory in order as sorting the files (by date/time) is much more efficient than sorting individual rows. — MikeKulls

Praveen Sripati Praveen Sripati · Accepted Answer · 2012-01-12T07:58:08

Implement the FileInputFormat#isSplitable(). Then the input files are not split and are processed one per map. Note that the time to complete the job depends on the time to processes the largest input file, in spite of mappers executing in parallel. Also, this might not be efficient as there will be a lot of data shuffling across nodes.

import org.apache.hadoop.fs.*;
import org.apache.hadoop.mapred.TextInputFormat;
public class NonSplittableTextInputFormat extends TextInputFormat {
    @Override
    protected boolean isSplitable(FileSystem fs, Path file) {
        return false;
    }
}

The current API doen't allow a whole directory to be processed by a single mapper. You might have to write your own InputFormat. Or else create a list of directories to be processed and pass a single directory to each mapper to be processed, again this is not efficient because of data shuffling between nodes.

Coming back to reducers, they operate on the output KV pairs from the mappers and not the input files/directories.

one mapper or a reducer to process one file or directory

1 Answers