2
votes

I have a hbase table and row key will be like <<timestamp>>_<<user_id>> where time stamp will be yyyyMMddHHmm. My concern is to query user details in a given time range.

eg: "201602021310_user1"

HTable table = new HTable(conf, tableName);
        Scan s = new Scan();
        s.setStartRow("20160202".getBytes());
        s.setStopRow("20160303".getBytes());
        ResultScanner ss = table.getScanner(s);
        List<Result> rs = new ArrayList<Result>();
        for(Result r:ss){
            rs.add(r);
        }

According to my understanding there won't be any issue since Hbase store data in lexicographically order. But this implementation will cause the region server hot spotting. In order to avoid hot spotting,(expecting comments)

  1. I am thinking of use a hashed prefix in my row key. If so I am feeling that my range scan will not work as I want.
  2. Then use a filtering like fuzzy filter. But I couldn't find a way to achieve range queering. According to my understating what I can achieve through this is filter up to each month and merge results. 201602??_?????? + 20160301_?????? +20160302_??????+20160303_??????

What will be the best approach for achieve this ? ( eliminating hot spotting while supporting range queering)

1

1 Answers

2
votes
row_key = (++index % BUCKETS_NUMBER) + original_key

Where,

  • index - The numeric (or any sequential) part of the specific record/row ID.
  • BUCKETS_NUMBER - the number of “buckets” we want our new row keys to be spread across.
  • original_key - The original key of the record we want to write.

New row keys of bucketed records will no longer be in one sequence, but records in each bucket will preserve their original sequence. Since data is placed in multiple buckets during writes, we have to read from all of those buckets when doing scans based on “original” start and stop keys and merge data so that it preserves the “sorted” attribute. Scan per bucket can be parallelized and so the performance won’t be degraded.

Extracted from the Sematext blog post HBaseWD: Avoid RegionServer Hotspotting Despite Sequential Keys

You can read this for a complete answer/explanation