0
votes

We are planning to store time series sensor data in Cassandra. Each sensor can have multiple data points per sample time point. I'd like to store all the data points per device together.

One thought I had was to create all the potential columns for the various data types we might collect:

CREATE TABLE ddata (
  deviceID int,
  day timestamp,
  timepoint timestamp, 
  aparentPower int,
  actualPower int,
  actualEnergy int,
  temperature float,
  humidity float,
  ppmCO2 int,
  etc, etc, etc...
  PRIMARY KEY ((deviceID,day),timepoint)
) WITH
  clustering order by (timepoint DESC);

insert into ddata (deviceID,day,timepoint,temperature,humidity) values (1000001,'2013-09-02','2013-09-02 00:00:04',93,97.3);

 deviceid | day                      | timepoint                | actualenergy | actualpower | aparentpower | event | humidity | ppmco2 | temperature
----------+--------------------------+--------------------------+--------------+-------------+--------------+-------+----------+--------+-------------
  1000001 | 2013-09-02 00:00:00-0700 | 2013-09-02 00:00:04-0700 |         null |        null |         null |  null |     97.3 |   null |          93
  1000001 | 2013-09-02 00:00:00-0700 | 2013-09-02 00:00:03-0700 |         null |        null |         null |  null |     null |   null |          92
  1000001 | 2013-09-02 00:00:00-0700 | 2013-09-02 00:00:02-0700 |         null |        null |         null |  null |     null |   null |          91
  1000001 | 2013-09-02 00:00:00-0700 | 2013-09-02 00:00:01-0700 |         null |        null |         null |  null |     null |   null |          90

The other thought was to create a map collection of the various data points a given device might be reporting:

CREATE TABLE ddata (
  deviceID int,
  day timestamp,
  timepoint timestamp, 
  feeds map<text,int>,
  PRIMARY KEY ((deviceID,day),timepoint)
) WITH
  clustering order by (timepoint DESC);

insert into ddata (deviceID,day,timepoint,feeds) values (1000001,'2013-09-01','2013-09-01 00:00:04',{'temp':73,'humidity':99});

 deviceid | day                      | timepoint                | event      | feeds
----------+--------------------------+--------------------------+------------+----------------------------------------------------------
  1000001 | 2013-09-02 00:00:00-0700 | 2013-09-02 00:00:04-0700 |       null |                             {'humidity': 97, 'temp': 93}
  1000001 | 2013-09-02 00:00:00-0700 | 2013-09-02 00:00:03-0700 |       null |                                             {'temp': 92}
  1000001 | 2013-09-02 00:00:00-0700 | 2013-09-02 00:00:02-0700 |       null |                                             {'temp': 91}
  1000001 | 2013-09-02 00:00:00-0700 | 2013-09-02 00:00:01-0700 |       null |                                             {'temp': 90}

What are folks thoughts on the two options?

  • From what I can see the first option will allow better typing of the different datatypes (int versus float), but makes the table kind of ugly.
  • Will performance be better if I avoid using the collection types?
  • Is continuously adding additional columns as new sensor data types are added anything to worry about?

  • What other factors should I consider?

  • What other data modeling ideas do folks have for this scenario?

Thanks, Chris

2

2 Answers

1
votes

Essentially, since we do not know how many measurements that would arrive, we need a dynamic way to describe this in the Column family.

As you pointed in the second example, CQL provides map data type to hold dynamic collections.

The second one is preferred. But also depends upon queries that you may be issuing. To get 'temp' from 'feeds', the application has to parse the map output.

1
votes

The immediate pros and cons I can see:

    • using a map column would allow you to have "unlimited" metrics. (nb I think there's a limitation on how much data you can store in the map though)
    • you won't be able to read a single value out of the map; if you have columns for each metric you can read a single value at a time; you'll still be able to update a single value inside the map
  1. as you mention in your question, you have limited type support in a map

These are the most obvious differences I can see.