0
votes

I need to store sensor readings in cassandra (version 2!). There are n sensors, of which each can send up to m different values which have different types (e.g. Float, Bool, String). The values have to be stored in cassandra. Later, values will be queried mostly by time ranges. So a query could be "give me all readings from 2016-05-01 09:00 to 2016-05-15 13:00". There could be filters by sensor ID/type, but the main query will always be the time. (So a query could be "Give me all data for sensor 1 and 5 from 2016-05-05" but most likely not "give me all data for sensor 1 and 5").

For more detailed queries, it is ok if all the data (restricted by time and possibly sensor ID) has to be scanned. So for the query "give me all sensor data for sensor 5 from 2016-05-05 where the float value of a reading is greater than 1000" it's OK if cassandra has to scan all the values of sensor 5 from 2016-05-05!

I read a lot of blog posts/questions about data modelling, (e.g. [1] [2] [3] [4] [5] [6] ) but some stuff is years old and I am not sure, if it is still the right way to do it.

My main questions are:

  • What data type do I use for the timestamp (needs millisecond resolution)
  • How do I define the keys? (e.g. do I need an hourly primary key like some examples use? If yes, can I combine results for more than one hour in cassandra or do I need to do that manually?)
  • how do I add the sensorID so it can also be efficiently queried

The sensor data will always be inserted ordered, so no previous data is changed and no data with a timestamp lower than the current maximum will ever be added.

1

1 Answers

4
votes

What data type do I use for the timestamp (needs millisecond resolution)

timeuuid definitely

How do I define the keys? (e.g. do I need an hourly primary key like some examples use? If yes, can I combine results for more than one hour in cassandra or do I need to do that manually?)

The biggest secret about data modeling with Cassandra is to limit the size of the physical partition to something manageable (~100Mb / 10 millions cells)

In your case, the sub-partitioning for each sensor depends on the insertion rate.

If some sensors insert data as crazy as thousands data points/ sec, a partition per hour is the appropriate granularity. Of course 1 partition per hour (PRIMARY KEY ((sensor_id, hour), insertion_time_in_timeuuid)) will limit your query capabilities e.g. if you want data for some sensor between 4pm and 10 pm you'll need to issue 6 queries or use the IN clause (SELECT * FROM ... WHERE sensor_id=xxx AND hour IN (16, 17, 18, 19, 20, 21, 22))

If the insertion rate is moderate, you may sub-partition by day/week/month. There is no thumb of rule here since it all depends on the data volume.

The key thing to remember is to keep the balance between ease of query vs partition size.

how do I add the sensorID so it can also be efficiently queried

Put it as a component of the partition key along side with the sub-partition time range e.g. PRIMARY KEY( (sensor_id, hour), insertion_time_in_timeuuid)

The sensor data will always be inserted ordered, so no previous data is changed and no data with a timestamp lower than the current maximum will ever be added.

CREATE TABLE sensor_data (
  sensor_id timeuuid,
  partitioning_time_range bigint,
  insertion_time_in_timeuuid timeuuid,
  float_value float
  int_value int,
  bool_value bool,
  text_value text,
  PRIMARY KEY( (sensor_id, hour), insertion_time_in_timeuuid)
) WITH CLUSTERING ORDER BY (insertion_time_in_timeuuid DESC);

To accommodate with different type of data, just create a column for a data type (float_value, bool_value, ...). At runtime if you only use 1 column out of 4/5 Cassandra will just insert 1 physical cell on disk (unlike relational databases which reserve the space for un-used columns)