How should I design my HBASE schema for this time series scenario?

Question

We have a data stream that contains: time stamp, object ID, data The needed processing is:

for each new entry, fetch all entries with the same object ID, and do something with all data

One option is to use a separate queuing service. In this case, the HBASE schema can include the object id as a simple key, as all queries are based on the object ID.

The main drawback is the need to maintain an additional infrastructure.

Another option is to use a complex key of the form <object ID>.<time stamp>

I would also add a 'processed' boolean flag in the value, to indicate whether this record was already processed. (this flag can be either in the same column family as the other data, or separate)

The queries by user id should remain fast, as they query a sequential set of keys.

However, i'm not sure that querying by time range will also be fast in this case.

Will making the processed flag a separate column family, which will be deleted once processed increase the performance? (theoretically, only this column family needs to be scanned, and it will include only unprocessed records. on the other hand - having a column family for one flag may incur to much overhead)

Any other suggestions or refinements?

Tariq Tariq · Accepted Answer · 2013-11-27T20:36:16

How about having a 1 byte flag(0/1) prefixed to the key which will represent whether the record is processed or not? This way you can filter out processed records form the unprocessed ones in lesser time than having the flag stored as a separate column. You don't even need to look inside the rows. Just traversing across the rowkeys will give you the clear picture.

How should I design my HBASE schema for this time series scenario?

2 Answers