0
votes

I new to HBase, I have a main table with rowkey =id-YYYYMMDD, and a secondary index table with rowkey =YYYYMMDD-id and a column with the rowkey in the main table. I will have about 1 million ids in the near future and I will need to create a MapReduce job to summarize the id in a given date (YYYYMMDD).

How do I pass the secondary index table to the mapreduce job so the corresponding "get(rowkey)" are run in the main table to get the columns and sumarize the data?

1

1 Answers

0
votes

You have 2 options:

  1. First you run a scan on the index table. Scan will have startRow and stopRow (e.g. '20190401' and '20190402'), so it will scan a continuous key space area and collect IDs from the main table. Time complexity will be O(M), where M is a number of items in a given batch. Then you request data from main table by ids using Get.
  2. Since you have date as part of your main table key, you can just do a MapReduce scan with a Key filtering, which will run in O(N/P), where N is a total amount of rows in table and P is the parallelism of your cluster.