0
votes

We have a data stream that contains: time stamp, object ID, data The needed processing is:

for each new entry, fetch all entries with the same object ID, and do something with all data

One option is to use a separate queuing service. In this case, the HBASE schema can include the object id as a simple key, as all queries are based on the object ID.

The main drawback is the need to maintain an additional infrastructure.

Another option is to use a complex key of the form <object ID>.<time stamp>

I would also add a 'processed' boolean flag in the value, to indicate whether this record was already processed. (this flag can be either in the same column family as the other data, or separate)

The queries by user id should remain fast, as they query a sequential set of keys.

However, i'm not sure that querying by time range will also be fast in this case.

Will making the processed flag a separate column family, which will be deleted once processed increase the performance? (theoretically, only this column family needs to be scanned, and it will include only unprocessed records. on the other hand - having a column family for one flag may incur to much overhead)

Any other suggestions or refinements?

2

2 Answers

1
votes

How about having a 1 byte flag(0/1) prefixed to the key which will represent whether the record is processed or not? This way you can filter out processed records form the unprocessed ones in lesser time than having the flag stored as a separate column. You don't even need to look inside the rows. Just traversing across the rowkeys will give you the clear picture.

1
votes

I would not add the processed flag to the key - since it would mean you'd have to change the key (i.e. delete the old) once you processed a record. Deleted keys only get cleaned in major compactions.

If you are continuously querying for timestampss to get the latest unhandled records you can just store your timestamp in the HBase timestamp (each put can also include a timestamp) and then add that to the scan (it accepts start/end timestamps in addition to the key) - this would work quite fast for latest data as it is likely it is still in the memstores. Note that the timestamp is a binary field like any other so you can also add a prefix there if the record was processed or not (so processing a recod will create a new version with a different "timestamp") of the row