0
votes

Problem: I need to insert some user ids in Hbase after every hour and every day (eg: 2201201711, this represents 22nd Jan 2017: 11 AM data). what should be the design of the table if I want to fetch all user ids for a particular hour on a date or in data and time range.

What I have done so far, I keeping user ids as row keys and creating column on run time in same column family. file data : user id | date time 1 2201201711 2 2201201711 3 2201201711

my hbase row keys would be 1, 2 and 3 and new column would be created 2201201711.

I know I can go with composite key using date, hour and user Id. But I wanna understand what benefits it does provides in term of performance.

What is the performance diff if I select a whole column (with out any filter) vs looking up using composite row keys.

1

1 Answers

0
votes

The solution can differ based on the amount of data that you are going to put into this table and way you are often going to read this table (Scan or Get).

My Solution would be considering that, this table is going to be huge and often scan is going to be performed on this table:

The data time part can be converted into EPOCH and the converted value can be used as the rowKey for your table and your user ids can remain in column qualifier. By this way, it would be efficient when you want to scan the whole table for the particular range of datetime using startRow and endRow in Scan. As far as I've seen, scan is performing better when we are scanning a huge table in this way since it skips records before and after the mentinoed startRowKey and endRowKey.