1
votes

I need to capture time-series sensor data in Cassandra. The best practices for handling time-series data in DynamoDB is as follow:

  1. Create one table per time period, provisioned with write capacity less than 1,000 write capacity units (WCUs).
  2. Before the end of each time period, prebuild the table for the next period.
  3. As soon as a table is no longer being written to, reduce its provisioned write capacity. Also reduce the provisioned read capacity of earlier tables as they age, and archive or delete the ones whose contents will rarely or never be needed.

Now I am wondering how I can implement the same concept in Cassandra! Is there any way to manually configure write/read capacity in Cassandra as well?

2

2 Answers

1
votes

For cassandra just make a single table but include some time period in the partition key (so the partitions do not grow indefinitely and get too large). No table maintenance and read/write capacity is really more dependent on workload and schema, size of cluster etc but shouldn't really need to be worried about except for sizing the cluster.

1
votes

This really depends on your own requirements that you need to discuss with development, etc.

There are several ways to handle time-series data in Cassandra:

  1. Have one table for everything. As Chris mentioned, just include the time component into partition key, like a day, and store data per sensor/day. If the data won't be updated, and you know in advance how long they will be kept, so you can set TTL to data, then you can use TimeWindowCompactionStrategy. Advantage of this approach is that you have only one table and don't need to maintain multiple tables - that's make easier for development and maintenance.
  2. The same approach as you described - create a separate table for period of time, like a month, and write data into them. In this case you can effectively drop the whole table when data "expires". Using this approach you can update data if necessary, and don't require to set TTL on data. But this requires more work for development and ops teams as you need to reach multiple tables. Also, take into account that there are some limits on the number of tables in the cluster - it's recommended not to have more than 200 tables as every table requires a memory to keep metadata, etc. Although, some things, like, a bloom filter, could be tuned to occupy less memory for tables that are rarely read.