1
votes

I'm trying to write a MapReduce job that parses a CSV file, store data in HBase and do a reduce function in one go. Ideally I would like

  1. Mapper output good records to HBase Table GOOD
  2. Mapper output bad records to HBase Table BAD
  3. Mapper send all the good data to a reducer using a key
  4. Would also like to update a third table indicating presence of new data. This table will have basic info about data and date. Most probably one or two records per CSV file.

I know how to do 1 and 2 using HBase MultiTableOutputFormat, but unsure how to do 3 and 4.

Any pointers on how to do this is much appreciated.

I've a few thoughts on how to do this:

For 1 and 2 I would have ImmutableBytesWriteable as key and MultiTableOutputFormat takes care of storing from Mapper. But for 3 I would like the key to be Text.

For #4, should I do this in the Mapper by

  1. Scanning third HBase table for entry, if not there populate otherwise skip. I don't like this since it feels very inefficient.
  2. OR should I maintain a List in Mapper and write to HBase in Mappper cleanup method?
  3. Is there a better a way to do this?
1

1 Answers

2
votes
  • mapper reads csv by setting KeyValueTextInputFormat .

  • In mapper code , have some logic to distinguish between good and bad records and put them in Hbase by using Put(Hbase Api calls ) .

In mapper setup a handler for hbaseTable can be intialized .

The good record can be passed to reducer using context.write(key,value) and collected in the reducer