I have a dataset in hbase which is large enough that it takes a couple hours to run a mapreduce job on the entire dataset. I'd like to be able to break down the data using precomputed indexes: once a day map the entire data set and break it down into multiple indexes:
- 1% sample of all users
- All users who are participating in a particular A/B experiment
- All users on the nightly prerelease channel.
- All users with a paticular addon (or whatever criterion we're interested in this week)
My thought was to just store a list of row IDs for the relevant records, and then later people can do little mapreduce jobs on just those rows. But a 1% sample is still 1M rows of data, and I'm not sure how to construct a mapreduce job on a list of a million rows.
Does it make any sense to create a table mapper job using initTableMapperJob(List scans) if there are going to be a million different Scan objects which make up the query? Are there other ways to do this so that I can still farm out the computation and I/O to the hbase cluster efficiently?