We have thousands of sensors that produce measurement time series data we want to store in Cassandra. We are currently storing ~500 million records per day, this amount will grow in the next time by factor 5-10.
We mostly work with the most recent measurement data. Old measurement data is barely read.
- We mostly read from the most up-to-date measurements (i.e, one week old),
- older measurements (i.e., having an age of less than a month) are only rarely read (ten times per week),
- very old measurements (i.e., having an age of 1-6 months) are very rarely read (once per month),
- measurements older than 6 months are assumed to be cold, i.e., never read.
As compaction strategy, we use DTCS. Setting a ttl is not an option, because we need to store the measurement data for archiving purposes.
I am not sure yet how to deal with the fact that "old data is almost cold".
Update: What I want to avoid: Having 20 TB in my Cassandra cluster, where 18TB are used, let's say, only once a year, if at all. I don't want to pay for 18 TB that are not needed. Setting a ttl is not an option because we should be able to read data, e.g., from March 2013 (additional cost for such a request is ok). If we set a ttl to, e.g., 6 months, then we cannot do that properly.
We are currently evaluating two design alternatives, and looking for the most cost effective:
- One keyspace, with partition key (sensor_id, measurement_date)
- One keyspace per month, with the same partition key (sensor_id, measurement_date)
(in both cases, we will have at most 500K columns per row, mostly less than 100K)
The disadvantage of 2. is that we will have <100 keyspaces instead of 1, and the complexity when reading the data is increased. The advantage of 2. is that we can snapshot/backup/delete/restore them on a monthly basis, which - from my understanding - cannot be easily done if we go with option 1. This way, we don't have to size our Cassandra cluster to hold terabytes of data that is actually cold.
My question: Is 2. a reasonable option for our use case, or is this considered an anti-pattern in Cassandra?
Thank you for your help!