2
votes

I am very new to Hadoop and HBase concepts. So please forgive me if the answer to my question is too obvious.

I need to get sales report from two HBase tables. I am trying to represent here the most simplified form of the problem I am dealing with. There are two tables, Products and Sales.

Products Table

ProductCode   ProductName
-----------   -----------
APL                Apple
BAN                Banana
MNG                Mango
ORG                Orange

Sales Table

ProductCode    Quantity
---------    ----------
MNG                 100
BAN                   8
MNG                   3
APL                  24
APL                  57
BAN                  33
ORG                  40
ORG                  15

The kind of reduced output I need :

Report

Product Name    Total Sales
==========     ========
Apple              81
Banana             41
Mango             103
Orange             55

Only difference in real is that both the table contains 100s of millions of records.

I am trying to use the map reduce example from the Apache HBase Documentation here : http://hbase.apache.org/book/mapreduce.example.html

But I cant find a way to use two tables in Map Reduce.

What is the correct way of doing this ?

Any suggestion would be of great help at this point.

2

2 Answers

1
votes

Well, it is a 'join' problem here:

1/ if the product table is small, let's say less than 200MB, you can do a replicated join with an export of your table and use it in a mapper-only driver

2/ if both tables are really big, use a chained job: a job on sales group/count then use the output for the next job on products

3/ if both tables are really biiiiiiig, Hbase performs very well with flat data. So the most efficient way should be to have the products data inside the sales table. Denormalization is the key in Hadoop.

I suggest you to read this excellent book: MapReduce Design Patterns (http://shop.oreilly.com/product/0636920025122.do)

1
votes

Assuming both tables are keyed by the product code, you could do a merge join: Map over one table, and then scan the second table using the key from the first table (start row == key, and end row == key with the last byte incremented).

This can even work if you have compound keys, as long as the product code (the thing you want to join on) is the first part of the key in both tables.

Otherwise, @Treydone's advice is the way to go.