timeseries data schema design for google bigtable or any google offering

Question

I am working on a project wherein I have to store events related to user activity per user on a daily basis for later analysis. I will be getting stream of timestamped events and later on will run dataflow jobs on this data for analytics to get stats per user. I am exploring big table to store this data, wherein timestamp will act as a key for each row, later I will run a range query to get single day data and process it. But after going through couple of resources figured that with timestamped row keys , big table can get into hotspotting problem. Can't promote userid as a key in row key to avoid this. Any alternative approach to solve this or any other storage engine that can help in this use case.

use case: The use case is that I have user Activity data like impression and clicks in streams. Based on rules I have to aggregate data from these streams for a certain duration, store it and serve it asap to upstream service. Data will be processed in a tumbling window fashion as of know 24 hr but it may increase or decrease. The choice I have to make is, how to store raw events(Bigtable or big query or direct analysis on streams), compute engine(beam vs aggregation queries) and final storage(based on user id). Relation b/w user and aggregated data is one to many.

Could you provide more information about why you aren't able to promote userid into the rowkey? — Billy Jacobson
while retrieval only information available is the date for which the dataflow pipeline will run aggregations. User-level information will not be available. — deep

Billy Jacobson Billy Jacobson · Accepted Answer · 2020-04-01T15:30:38

Given that you can't access the userid at query time, you'll have to make a tradeoff somewhere. It seems like you'll be doing more writes than reads here because you're writing data during the day for each user and then only doing a read maybe once a day to analyze the data? Correct me if my interpretation is wrong.

I would say it's fine if the scan in your dataflow job isn't as efficient in order to avoid hotspots for your writing.

You can promote the userid into your rowkey to something like this userid#date, and then do a scan with a rowkey regex filter that does looks for *#YOUR_DATE.

This isn't the most efficient scan since it is a full table scan AND uses a fairly intensive filter, but to optimize your database for writing data, this would still allow you to read the data.

Feel free to provide more information about your pipeline and expected database use case, if my assumptions don't align with your goals.

timeseries data schema design for google bigtable or any google offering

2 Answers