Performing HBase queries optimally in MapReduce

Question

Problem

We have multiple HBase tables :A, B, C. Lets assume, A is queue of records that needs to be processed. It could contain an average of 25 million records. A has user ids. B has website hits performed by each user. B could contain billions of rows. C has some secondary information about users.

We use MapReduce job to perform predictive analysis (thousands and thousands of decision trees) on the records in queue. Scope of the questions does not include actual analytic modeling.

Question

MR job is performing adhoc queries to tables B & C. For example, Map task 1 performing query to get hits for user 1 and Map task 2 performing query to get hits for user 2. If these hits ends up in same region server, would it hamper performance (race conditions, etc)? Is there a pattern like ChainMapper (ChainReducer) to split input set so that each mapper has keys that spans one region server?
My initial thought was to have queue to contain all required inputs (results from b and c). This input would be condensed (only which is required for modeling). This approach would avoid performing adhoc queries (across region servers by multiple map tasks in the same time).

Any other suggestions welcome.

We are using cloudera CDH 3 (hadoop, hbase).

octo octo · Accepted Answer · 2012-10-14T19:57:06

It is not very easy to solve, but I can suggest to use bloomfilter + reduce join.

Build bloomfilter and set of affected regions of B

Map: A -> BF(A), S = {regions of B}

Use custom InpufFormat, which will use affected regions for B-table scan, and scan whole table A

Map: B U S -> (tag 'B', keyB => value)
     A -> (tag 'A', keyA => value)
Reduce: reduce-join

Do your analytic work in Reduce.

Performing HBase queries optimally in MapReduce

1 Answers