compact mapper part files

Question

I have a mapreduce job that exports the plain text of an hbase table. I'm emulating the Export class that ships with hbase and not running any reducers. In addition, I'm just writing an empty String for the key. Something like this:

public void map(ImmutableBytesWritable key, Result value, Context context) throws IOException, InterruptedException {
    List<Cell> cells = value.listCells();
    for(Cell cell : cells) {
        context
            .write(new Text(""), new Text(CellUtil.cloneValue(cell)));
    }
}

This works fine, except I'm at the mercy of however many splits there are in the hbase table with regard to the number of output map files (e.g. part-m-NNNNN).

Is there a way to combine the output map files in the mapreduce job?

I've considered using a random integer between 1-50 for the key and then using a reducer that then strips the key before writing out to HDFS, but this seems like a hack.

Ramzy Ramzy · Accepted Answer · 2015-10-20T20:01:26

Irrespective of your input, I understand that you want to merge all the map outputs. Below are the options.

getmerge shell command - This will give the merged file to local directory.
Make input not splittable, so only one mapper runs and one mapper output - Since you are reading HBase, this might not be a good option to have a single mapper do the entire work.
Write a reducer and set map reduce to have only one reducer, which is what you are doing.

Given your link with HBase, 1,3 are good options. Not sure why you consider it as Hack. You can use row key as mapper output key rather than random integer.

compact mapper part files

1 Answers