Hadoop mapper output to HBase table and a reducer

Question

I'm trying to write a MapReduce job that parses a CSV file, store data in HBase and do a reduce function in one go. Ideally I would like

Mapper output good records to HBase Table GOOD
Mapper output bad records to HBase Table BAD
Mapper send all the good data to a reducer using a key
Would also like to update a third table indicating presence of new data. This table will have basic info about data and date. Most probably one or two records per CSV file.

I know how to do 1 and 2 using HBase MultiTableOutputFormat, but unsure how to do 3 and 4.

Any pointers on how to do this is much appreciated.

I've a few thoughts on how to do this:

For 1 and 2 I would have ImmutableBytesWriteable as key and MultiTableOutputFormat takes care of storing from Mapper. But for 3 I would like the key to be Text.

For #4, should I do this in the Mapper by

Scanning third HBase table for entry, if not there populate otherwise skip. I don't like this since it feels very inefficient.
OR should I maintain a List in Mapper and write to HBase in Mappper cleanup method?
Is there a better a way to do this?

KrazyGautam KrazyGautam · Accepted Answer · 2015-04-28T18:20:35

mapper reads csv by setting KeyValueTextInputFormat .
In mapper code , have some logic to distinguish between good and bad records and put them in Hbase by using Put(Hbase Api calls ) .

In mapper setup a handler for hbaseTable can be intialized .

The good record can be passed to reducer using context.write(key,value) and collected in the reducer

Hadoop mapper output to HBase table and a reducer

1 Answers