0
votes

I have a dataset in hbase which is large enough that it takes a couple hours to run a mapreduce job on the entire dataset. I'd like to be able to break down the data using precomputed indexes: once a day map the entire data set and break it down into multiple indexes:

  • 1% sample of all users
  • All users who are participating in a particular A/B experiment
  • All users on the nightly prerelease channel.
  • All users with a paticular addon (or whatever criterion we're interested in this week)

My thought was to just store a list of row IDs for the relevant records, and then later people can do little mapreduce jobs on just those rows. But a 1% sample is still 1M rows of data, and I'm not sure how to construct a mapreduce job on a list of a million rows.

Does it make any sense to create a table mapper job using initTableMapperJob(List scans) if there are going to be a million different Scan objects which make up the query? Are there other ways to do this so that I can still farm out the computation and I/O to the hbase cluster efficiently?

1

1 Answers

1
votes

Don't do a million scans. If you have a million non-contiguous ids, you could run a map/reduce job over the list of ids using a custom input format so that you divide the list up into a reasonable number of partitions (I would guess 4x the number of your m/r slots, but that number is not based on anything). That would give you a million get operations, which is probably better than a million scans.

If you are lucky enough to have a more reasonable number of contiguous ranges, then scans would be better than straight gets