I have a table with hundred of million records. This table contains data about servers and events genereated on them. Following is the row key of the table:
rowkey = md5(serverId) + timestamp [32 hex characters + 10 digits = 42 characters]
One of the use case is to list all the events from time t1 to t2. For this, normal scan is taking too much time. To speed up the things, I have done the following:
- Fetch the list of unique serverId from another table (real fast).
- Divide the above list in 256 buckets based on first two hex characters of md5 of serverIds.
- For each bucket, call a co-processor (parallel requests) with list of serverId, start time and end time.
The co-processor scan the table as follow:
for (String serverId : serverIds) {
byte[] startKey = generateKeyserverId, startTime);
byte[] endKey = generateKey(serverId, endTime);
Scan scan = new Scan(startKey, endKey);
InternalScanner scanner = env.getRegion().getScanner(scan);
....
}
I am able to get the result quick fast with this approach. My only concern is the large number of scans. If the table has 20,000 serverIds then the above code is making 20,000 scans. Will it impact the overall performance and scalability of HBase?