0
votes

I have a table with hundred of million records. This table contains data about servers and events genereated on them. Following is the row key of the table:

rowkey = md5(serverId) + timestamp [32 hex characters + 10 digits = 42 characters]

One of the use case is to list all the events from time t1 to t2. For this, normal scan is taking too much time. To speed up the things, I have done the following:

  1. Fetch the list of unique serverId from another table (real fast).
  2. Divide the above list in 256 buckets based on first two hex characters of md5 of serverIds.
  3. For each bucket, call a co-processor (parallel requests) with list of serverId, start time and end time.

The co-processor scan the table as follow:

for (String serverId :  serverIds) {
  byte[] startKey = generateKeyserverId, startTime);
  byte[] endKey = generateKey(serverId, endTime);
  Scan scan = new Scan(startKey, endKey);
  InternalScanner scanner = env.getRegion().getScanner(scan);
  ....
}

I am able to get the result quick fast with this approach. My only concern is the large number of scans. If the table has 20,000 serverIds then the above code is making 20,000 scans. Will it impact the overall performance and scalability of HBase?

1
answer below suggest timestamp filter but it requires cell level scan, your solution only uses rowkeys and will be much faster.halil
The solution I have described in the question is fast and I am satisfied with the performance. My question is regarding the long term affect on HBase considering the number of scans on the server.Ravi Singal
yes it affects performance when number of servisids increase.halil

1 Answers

0
votes

Try using timestamp filter. following is the syntax to test in hbase shell import java.util.ArrayList import org.apache.hadoop.hbase.filter.TimestampsFilter list=ArrayList.new() list.add(1444398443674) //START TIMESTAMP list.add(1444457737937) //END TIMESTAMP scan 'eventLogTable', {FILTER=>TimestampsFilter.new(list)}

Same api exits in java and other languages too.