3
votes

We have thousands of sensors that produce measurement time series data we want to store in Cassandra. We are currently storing ~500 million records per day, this amount will grow in the next time by factor 5-10.

We mostly work with the most recent measurement data. Old measurement data is barely read.

  • We mostly read from the most up-to-date measurements (i.e, one week old),
  • older measurements (i.e., having an age of less than a month) are only rarely read (ten times per week),
  • very old measurements (i.e., having an age of 1-6 months) are very rarely read (once per month),
  • measurements older than 6 months are assumed to be cold, i.e., never read.

As compaction strategy, we use DTCS. Setting a ttl is not an option, because we need to store the measurement data for archiving purposes.

I am not sure yet how to deal with the fact that "old data is almost cold".

Update: What I want to avoid: Having 20 TB in my Cassandra cluster, where 18TB are used, let's say, only once a year, if at all. I don't want to pay for 18 TB that are not needed. Setting a ttl is not an option because we should be able to read data, e.g., from March 2013 (additional cost for such a request is ok). If we set a ttl to, e.g., 6 months, then we cannot do that properly.

We are currently evaluating two design alternatives, and looking for the most cost effective:

  1. One keyspace, with partition key (sensor_id, measurement_date)
  2. One keyspace per month, with the same partition key (sensor_id, measurement_date)

(in both cases, we will have at most 500K columns per row, mostly less than 100K)

The disadvantage of 2. is that we will have <100 keyspaces instead of 1, and the complexity when reading the data is increased. The advantage of 2. is that we can snapshot/backup/delete/restore them on a monthly basis, which - from my understanding - cannot be easily done if we go with option 1. This way, we don't have to size our Cassandra cluster to hold terabytes of data that is actually cold.

My question: Is 2. a reasonable option for our use case, or is this considered an anti-pattern in Cassandra?

Thank you for your help!

2
> I have up to 500k measurements per sensor per day. I wonder what kind of compression you use. 500mil total/500k per sensor=1000 - this is only 1000 sensors at the moment. Lets assumes 10 metrics per sensor. So you may not need a large namespace for keys if you can compress each series. Is this digital data (0/1) or analog, what kind of variance is observed. Do you need to keep precise values (bigdecimal)? - Sergei Rodionov

2 Answers

3
votes

Generally you wouldn't want to have older cold data in a separate keyspace, as this would get difficult to maintain (as you mentioned). Right now you challenge appears to be very wide rows due to how you are partitioning the data. Instead I would propose you "bucket" the data by month. This can be done by modifying your partition key like so:

PRIMARY KEY ((year,month,sensor_id), measurement_date)

The extra parenthesis are CQL Syntax for declaring multiple columns as the partition key. This means you will always have to provide a year,month, and sensor_id to read from this table. However remember that in Cassandra Primary Keys (unlike Relational Databases) define how your data is distributed across the cluster. So effectively what we are doing is bucking sensor data by a year-month in it's own row. Hence we're basically achieving what you were thinking of with multiple keyspaces, but in a much more Cassandra and Developer friendly way.

To insert data into this table it will be pretty easy. Assuming that measurement_date is a timeuuid (it should be otherwise you may be overwriting data) here's the general flow your code would do:

  1. Generate a timeuuid (UUIDv1) for the current time
  2. From the timeuuid get the year and month parts
  3. Then execute your CQL for INSERT:

    • INSERT INTO time_series (year,month,sensor_id, measurement_date) VALUES (2016,4,'sensor_id','generated timeuuid here');

Like I mentioned before reading data from the table should be pretty straight forward. If you want some more info I have a much longer response that's related to your data modeling question here.

Since you're writing 500K measurements per day you will want to further bucket this data (see above SO answer for extra details), since generally C* starts to perform poorly when you're clustering column is over the 10k mark.

Finally you might want to read Optimizing Cold SS Tables since it provides some good information in it. For example you can tune cold_reads_to_omit so you will not waste time compacting very cold tables. For DTCS you can set max_sstable_age_days to stop compacting SS Tables of a certain age to save IO on cold tables.

Update: Storage size management: If you want to keep with using just one table for everything there's a few things you could tweak. First make sure the table is using compression (ideally lz4), next you could lower the replication factor which would also save space. I suppose if you had different keyspaces for old data and new you could have a different RF for each to save space.

For the amount of data you're pushing and needing to archive I would encourage looking into Time Series Databases (TSDB) like Graphite and InfluxDB. For your goals and challenges a TSDB will be much easier to use and do than massaging Cassandra to do timeseries data.

1
votes

Partitioning data across a sequence of identical keyspaces or tables is not recommended, as both are designed to save schema and meta data details about your data and the actual data partition should be implemented based on the partitioning/clustering keys.

Although backing up data using snapshots does not work on monthly basis as intended, you probably could use incremental backups with a custom solution that would store flushed sstables together for a single month. For dropping data, using a TTL would still be the most common way for handling time series data and making sure you're not running out of disk space.