Part of the problem that you'll run into is that each sensor will be having 200k readings per day. In general, you want to keep each partition under 100k rows. So, your second idea (having date as part of the PK) may have perf issues.
Really what you are looking to do is what we refer to as 'bucketing'; how to group things together so queries are usable and performant.
To really help with this, we will need to understand a little more information:
- How many devices do you have? Will that number grow or is it finite?
- In plain English, what is an example of queries that you are trying to answer?
Incorporating this into the answer based on your answers (below):
Alright, here is a potential idea...
We DO care about bucketing though to try to stay around the 100k/partition optimal rows in a partition.
You're going to want two tables:
- Lookup table
- Sensor table
Lookup table will look something like:
CREATE TABLE lookup-table (
deviceid TEXT,
sensor-map MAP,
PRIMARY KEY (deviceid)
);
deviceid
is the unique ID for each device
sensor-map
is a JSON map of sensors that a given device has and a corresponding unique ID for that specific sensor (e.g. {temperature: 183439, humidity : 84543292, other-sensor : blah})
- That way each device has a mapping of sensors that is available to it
- Example query would be:
SELECT * FROM lookup-table WHERE deviceid = 1234;
- Another approach would be to have individual columns for each type of sensor and the unique ID for each sensor as a value
Sensor table will look like:
CREATE TABLE sensor_data (
sensorid TEXT,
sensor_value (whatever data type fits what you need),
ts TIMESTAMP,
reading_date date,
time_bucket int,
PRIMARY KEY ((reading_date, sensorid, time_bucket), ts)
) WITH CLUSTERING ORDER BY (ts DESC);
- As each sensor will get 200k readings/day AND we want to keep each partition under 100k rows, that means we want to do two partitions for each sensor each day
- How could you bucket? You should do it in two parts:you need to bucket daily; each sensor gets a new partition each day (
reading_date
) and split each day into two (due to the amount of readings that you're expecting); AM or PM; AM equals bucket 1, PM equals bucket 2. Or use 24 hour time where 0-1200 equals 1, 1300-2399 equals 2
- Within your application provide the specific
sensorid
and
time_bucket
will come from the time that you're actually requesting
the query (e.g. if time is 1135 hours, then time_bucket = 1
) and reading_date
will come from the actual day that you are querying
- Since you are clustering with
ts DESC
then it will retrieve the
latest reading for that given sensorid
. So it would look like
SELECT * from sensor_data WHERE reading_date = 12/31/2017 AND sensorid = 1234 AND time_bucket = 1 LIMIT 1;
- By maintaining
ts
as a clustering column, you'll be able to keep all of the readings for a given sensor; none will be overwritten
Important to know: this works great if there is an even distribution of sensor readings throughout the 24-hour day. However, if you're reading heavily in the morning and not at all in the afternoon, then it isn't an even and we'll have to figure out another way to bucket. But, I think that you get what is going on.
To query:
- There will be one query to retrieve all of the
sensorid
that a device has; once you have those sensorid
, you can then use it for the next step
- There will be n queries for each
sensor_value
for each sensorid
- Since we are bucketing (via
time_bucket
), you should have an even distribution throughout all of the partitions
Lastly: give me the latest sensorid
by a given value
To do that there are a couple of different ways...
- Run a Spark job: to do that, you'll have to lift and shift the data to run the Spark query
- Use DataStax Enterprise: with DSE you have an integrated Analytics component based on Spark so you can run Spark jobs without having to manage a separate Spark cluster. Disclosure: I work there, btw
- Create an additional Cassandra (C*) table and do some parallel writes
For the additional C* table:
CREATE TABLE sensor_by_value (
sensor-value INT,
ts TIMESTAMP,
sensorid TEXT,
reading_date DATE,
time_bucket INT,
PRIMARY KEY ((sensor-value, reading_date), ts)
) WITH CLUSTERING ORDER BY (ts DESC);
You will definitely have to do some time bucketing here:
- Remember, we don't want any more than 100k rows per partition
- You'll have to understand the possible values (range)
- The frequency of each reading
- If you have 100 devices, 100 sensors, and each sensor being read up to 200k per day, then you have a potential for up to 2B sensor readings per day...
- Typically, what I have my customers do is run some analysis on their data to understand these bits of info, that way you can be sure to account for it
- How much you have to bucket will depend on the frequency
- Good luck! :-)
Final tip
Look into compaction strategies: specifically time window compaction strategy (TWCS) and adding a default_time_to_live
Your data seems immutable after the initial insert
TWCS will make the operational overhead of compaction much lower as you fine-tune it for the time window that you need
A default_ttl
will also help with the operational overhead of deleting data after you don't need it anymore.
Does this answer and/or satisfy that queries that you're trying to answer? If not, let me know and we can iterate.
To learn all of this stuff, go to DataStax Academy for a ton of free training. Data Modeling (DS 220) is a great course!