We have a data stream that contains: time stamp, object ID, data
The needed processing is:
for each new entry, fetch all entries with the same object ID, and do something with all data
One option is to use a separate queuing service. In this case, the HBASE schema can include the object id as a simple key, as all queries are based on the object ID.
The main drawback is the need to maintain an additional infrastructure.
Another option is to use a complex key of the form <object ID>.<time stamp>
I would also add a 'processed' boolean flag in the value, to indicate whether this record was already processed. (this flag can be either in the same column family as the other data, or separate)
The queries by user id should remain fast, as they query a sequential set of keys.
However, i'm not sure that querying by time range will also be fast in this case.
Will making the processed flag a separate column family, which will be deleted once processed increase the performance? (theoretically, only this column family needs to be scanned, and it will include only unprocessed records. on the other hand - having a column family for one flag may incur to much overhead)
Any other suggestions or refinements?