Cassandra Data modeling IoT best practices

Question

I am fairly new to Cassandra and I am trying to understand how to design my tables for IoT sensors.

The idea is to have several devices, each with several sensors attached to it sending data periodically (up to around 200000 values per device per day per sensor)

I'd like to be able to query for the latest value of a sensor for a specific list of sensors and devices in more or less real-time. Also devices do not always send data and may be down for long periods of time.

After a lot of reading I came up with something like this

CREATE TABLE "sensor_data" (
    deviceid TEXT,
    sensorid TEXT,
    ts timestamp,
    value TEXT,
    PRIMARY KEY ((deviceid, sensorid), ts)
) WITH CLUSTERING ORDER BY (ts DESC);

The idea behind this would be to perform one query per device and sensor such as

Select deviceid, sensorid, ts, value where deviceid = "device1" and sensorid = "temperature" limit 1

And run this for each device and sensor. It's not one query to return it all (Which would be ideal) but seems to be fast enough to run for potentially up to 100 sensors or so (With possibilities for parallelizing the queries) for a few devices.

However from what I have read so far, I understand this would give me a lot of columns for my row and it might be complicated in terms of long term storage and Cassandra limitations.

I am thinking that maybe adding something like the date to the table like so (as seen on some blogs and guides) might be a good idea

CREATE TABLE "sensor_data" (
    deviceid TEXT,
    sensorid TEXT,
    date TEXT
    ts timestamp,
    value TEXT,
    PRIMARY KEY ((deviceid, sensorid, date), ts)
) WITH CLUSTERING ORDER BY (ts DESC);

And then query like

Select deviceid, sensorid, date, ts, value where deviceid = "device1" and sensorid = "temperature" and date = "2018-11-14" limit 1

Does that even make sense? It feels like it might mitigate storage issues and allow for easier archiving of old data in the future however how do I go about querying for the latest value of a specific sensor and device if that device was down for a day or more? Do I really have to query for 1 day, if nothing is found, query the previous day and so forth (Maybe limit it to only the last few days or so)?

Are there better ways to handle this in Cassandra or am I in the right direction?

Justin Breese Justin Breese · Accepted Answer · 2018-11-16T17:25:30

Part of the problem that you'll run into is that each sensor will be having 200k readings per day. In general, you want to keep each partition under 100k rows. So, your second idea (having date as part of the PK) may have perf issues.

Really what you are looking to do is what we refer to as 'bucketing'; how to group things together so queries are usable and performant.

To really help with this, we will need to understand a little more information:

How many devices do you have? Will that number grow or is it finite?
In plain English, what is an example of queries that you are trying to answer?

Incorporating this into the answer based on your answers (below):

Alright, here is a potential idea...

We DO care about bucketing though to try to stay around the 100k/partition optimal rows in a partition.

You're going to want two tables:

Lookup table
Sensor table

Lookup table will look something like:

CREATE TABLE lookup-table (
deviceid TEXT,
sensor-map MAP,
PRIMARY KEY (deviceid)
);

deviceid is the unique ID for each device
sensor-map is a JSON map of sensors that a given device has and a corresponding unique ID for that specific sensor (e.g. {temperature: 183439, humidity : 84543292, other-sensor : blah})
That way each device has a mapping of sensors that is available to it
Example query would be: SELECT * FROM lookup-table WHERE deviceid = 1234;
Another approach would be to have individual columns for each type of sensor and the unique ID for each sensor as a value

Sensor table will look like:

CREATE TABLE sensor_data (
sensorid TEXT,
sensor_value (whatever data type fits what you need),
ts TIMESTAMP,
reading_date date,
time_bucket int,
PRIMARY KEY ((reading_date, sensorid, time_bucket), ts)
) WITH CLUSTERING ORDER BY (ts DESC);

As each sensor will get 200k readings/day AND we want to keep each partition under 100k rows, that means we want to do two partitions for each sensor each day
How could you bucket? You should do it in two parts:you need to bucket daily; each sensor gets a new partition each day (reading_date) and split each day into two (due to the amount of readings that you're expecting); AM or PM; AM equals bucket 1, PM equals bucket 2. Or use 24 hour time where 0-1200 equals 1, 1300-2399 equals 2
Within your application provide the specific sensorid and time_bucket will come from the time that you're actually requesting the query (e.g. if time is 1135 hours, then time_bucket = 1) and reading_date will come from the actual day that you are querying
Since you are clustering with ts DESC then it will retrieve the latest reading for that given sensorid. So it would look like SELECT * from sensor_data WHERE reading_date = 12/31/2017 AND sensorid = 1234 AND time_bucket = 1 LIMIT 1;
By maintaining ts as a clustering column, you'll be able to keep all of the readings for a given sensor; none will be overwritten

Important to know: this works great if there is an even distribution of sensor readings throughout the 24-hour day. However, if you're reading heavily in the morning and not at all in the afternoon, then it isn't an even and we'll have to figure out another way to bucket. But, I think that you get what is going on.

To query:

There will be one query to retrieve all of the sensorid that a device has; once you have those sensorid, you can then use it for the next step
There will be n queries for each sensor_value for each sensorid
Since we are bucketing (via time_bucket), you should have an even distribution throughout all of the partitions

Lastly: give me the latest sensorid by a given value To do that there are a couple of different ways...

Run a Spark job: to do that, you'll have to lift and shift the data to run the Spark query
Use DataStax Enterprise: with DSE you have an integrated Analytics component based on Spark so you can run Spark jobs without having to manage a separate Spark cluster. Disclosure: I work there, btw
Create an additional Cassandra (C*) table and do some parallel writes

For the additional C* table:

CREATE TABLE sensor_by_value (
sensor-value INT,
ts TIMESTAMP,
sensorid TEXT,
reading_date DATE,
time_bucket INT,
PRIMARY KEY ((sensor-value, reading_date), ts)
) WITH CLUSTERING ORDER BY (ts DESC);

You will definitely have to do some time bucketing here:

Remember, we don't want any more than 100k rows per partition
You'll have to understand the possible values (range)
The frequency of each reading
If you have 100 devices, 100 sensors, and each sensor being read up to 200k per day, then you have a potential for up to 2B sensor readings per day...
Typically, what I have my customers do is run some analysis on their data to understand these bits of info, that way you can be sure to account for it
How much you have to bucket will depend on the frequency
Good luck! :-)

Final tip

Look into compaction strategies: specifically time window compaction strategy (TWCS) and adding a default_time_to_live

Your data seems immutable after the initial insert
TWCS will make the operational overhead of compaction much lower as you fine-tune it for the time window that you need
A default_ttl will also help with the operational overhead of deleting data after you don't need it anymore.

Does this answer and/or satisfy that queries that you're trying to answer? If not, let me know and we can iterate.

To learn all of this stuff, go to DataStax Academy for a ton of free training. Data Modeling (DS 220) is a great course!

Cassandra Data modeling IoT best practices

1 Answers