I'm trying to write a MapReduce job that parses a CSV file, store data in HBase and do a reduce function in one go. Ideally I would like
- Mapper output good records to HBase Table GOOD
- Mapper output bad records to HBase Table BAD
- Mapper send all the good data to a reducer using a key
- Would also like to update a third table indicating presence of new data. This table will have basic info about data and date. Most probably one or two records per CSV file.
I know how to do 1 and 2 using HBase MultiTableOutputFormat
, but unsure how to do 3 and 4.
Any pointers on how to do this is much appreciated.
I've a few thoughts on how to do this:
For 1 and 2 I would have ImmutableBytesWriteable
as key and MultiTableOutputFormat
takes care of storing from Mapper. But for 3 I would like the key to be Text.
For #4, should I do this in the Mapper by
- Scanning third HBase table for entry, if not there populate otherwise skip. I don't like this since it feels very inefficient.
- OR should I maintain a List in Mapper and write to HBase in Mappper cleanup method?
- Is there a better a way to do this?