3
votes

I am just trying to evaluate HBase for some of data analysis stuff we are doing.

HBase would contain our event data. Key would be eventId + time. We want to run analysis on few events types (4-5) between a date range. Total number of event type is around 1000.

The problem with running mapreduce job on the hbase table is that initTableMapperJob (see below) takes only 1 scan object. For performance reason we want to scan the data for only 4-5 event types in a give date range and not the 1000 event types. If we use the method below then I guess we don't have that choice because it takes only 1 scan object.

public static void initTableMapperJob(String table, Scan scan, Class mapper, Class outputKeyClass, Class outputValueClass, org.apache.hadoop.mapreduce.Job job) throws IOException

Is it possible to run mapreduce on a list of scan objects? any workaround?

Thanks

3

3 Answers

9
votes

TableMapReduceUtil.initTableMapperJob configures your job to use TableInputFormat which, as you note, takes a single Scan.

It sounds like you want to scan multiple segments of a table. To do so, you'll have to create your own InputFormat, something like MultiSegmentTableInputFormat. Extend TableInputFormatBase and override the getSplits method so that it calls super.getSplits once for each start/stop row segment of the table. (Easiest way would be to TableInputFormatBase.scan.setStartRow() each time). Aggregate the InputSplit instances returned to a single list.

Then configure the job yourself to use your custom MultiSegmentTableInputFormat.

0
votes

You are looking for the class:

org/apache/hadoop/hbase/filter/FilterList.java

Each scan can take a filter. A filter can be quite complex. The FilterList allows you to specify multiple single filters and then do an AND or an OR between all of the component filters. You can use this to build up an arbitrary boolean query over the rows.

0
votes

I've tried Dave L's approach and it works beautifully.

To configure the map job, you can use the function

  TableMapReduceUtil.initTableMapperJob(byte[] table, Scan scan,
  Class<? extends TableMapper> mapper,
  Class<? extends WritableComparable> outputKeyClass,
  Class<? extends Writable> outputValueClass, Job job,
  boolean addDependencyJars, Class<? extends InputFormat> inputFormatClass)

where inputFormatClass refers to the MultiSegmentTableInputFormat mentioned in Dave L's comments.