Partial Row key scan in HBase

2

votes

I have to following row key in my hbase deployment:

EquipmentNumber|LogTime

for example: 454312|20180304124511

Now I want to do a partial row key scan i.e. I want scan only on LogTime range

for example I want to get all the equipment numbers between logtime t1 and t2?

Can any body please help.

hadoophbase

In Hbase (at least 1.1.x, which I used) you cannot effectively scan over postfix part of the row key, because you do the full scan. Let's consider to redesign your schema. – wind

0

votes

HBase rows are sorted in alphabetical order, so you can scan by prefix, but you can't scan by the suffix.

One thing you can do is to scan the whole table using RowFilter - write filter logic based on your field LogTime. It will filter rows on server, so your client code will get only correct rows.

0

votes

With filters full scan seems unavoidable unless you know approximate range of equipment ids that can fall within given duration which may not be always possible.

An alternate way of achieving this can be to use an intermediate lookup/index table which maps your second part of row key to the first part or to the composite row key in the actual data table. This will keeps your primary access pattern to find records by equipment id as is and when you want to look up by second key use the lookup table to find row keys in your data table and use those keys to fetch the required data. This approach however would put responsibility on your application to keep the lookup table in sync to update/delete in your data table.

For auto-management of indexes you can try Phoenix, you can create a Phoenix table with global index on logTime, here is some quick sample -

CREATE TABLE "SO50228751"(
"equipNum" integer not null,
"logtime" bigint not null,
"f"."data" varchar
CONSTRAINT pk PRIMARY KEY ("equipNum", "logtime"));

Add data

upsert into "SO50228751"("equipNum", "logtime","f"."data")  values(454312,20180304124511,'a');
upsert into "SO50228751"("equipNum", "logtime","f"."data")  values(454312,20180304124512,'b');
upsert into "SO50228751"("equipNum", "logtime","f"."data")  values(454312,20180304124513,'c');
upsert into "SO50228751"("equipNum", "logtime","f"."data")  values(454312,20180304124514,'d');
upsert into "SO50228751"("equipNum", "logtime","f"."data")  values(454312,20180304124515,'e');
upsert into "SO50228751"("equipNum", "logtime","f"."data")  values(454313,20180304124521,'f');
upsert into "SO50228751"("equipNum", "logtime","f"."data")  values(454313,20180304124522,'g');
upsert into "SO50228751"("equipNum", "logtime","f"."data")  values(454313,20180304124523,'h');
upsert into "SO50228751"("equipNum", "logtime","f"."data")  values(454313,20180304124524,'i');
upsert into "SO50228751"("equipNum", "logtime","f"."data")  values(454312,20180304124524,'ii');

Create Index

CREATE INDEX so_idx ON "SO50228751"(logtime);

Query by logTime using index

select /*+ /*+ INDEX("SO50228751" so_idx) */ * from "SO50228751" where "logtime" between 20180304124511 and 20180304124516;

Before you decide on choosing Phoenix indexes please do check the documentation and this link -> https://community.hortonworks.com/articles/61705/art-of-phoenix-secondary-indexes.html to understand how well they fit in your use case.

Hope this helps.

Partial Row key scan in HBase

2 Answers