We are planning to store time series sensor data in Cassandra. Each sensor can have multiple data points per sample time point. I'd like to store all the data points per device together.
One thought I had was to create all the potential columns for the various data types we might collect:
CREATE TABLE ddata (
deviceID int,
day timestamp,
timepoint timestamp,
aparentPower int,
actualPower int,
actualEnergy int,
temperature float,
humidity float,
ppmCO2 int,
etc, etc, etc...
PRIMARY KEY ((deviceID,day),timepoint)
) WITH
clustering order by (timepoint DESC);
insert into ddata (deviceID,day,timepoint,temperature,humidity) values (1000001,'2013-09-02','2013-09-02 00:00:04',93,97.3);
deviceid | day | timepoint | actualenergy | actualpower | aparentpower | event | humidity | ppmco2 | temperature
----------+--------------------------+--------------------------+--------------+-------------+--------------+-------+----------+--------+-------------
1000001 | 2013-09-02 00:00:00-0700 | 2013-09-02 00:00:04-0700 | null | null | null | null | 97.3 | null | 93
1000001 | 2013-09-02 00:00:00-0700 | 2013-09-02 00:00:03-0700 | null | null | null | null | null | null | 92
1000001 | 2013-09-02 00:00:00-0700 | 2013-09-02 00:00:02-0700 | null | null | null | null | null | null | 91
1000001 | 2013-09-02 00:00:00-0700 | 2013-09-02 00:00:01-0700 | null | null | null | null | null | null | 90
The other thought was to create a map collection of the various data points a given device might be reporting:
CREATE TABLE ddata (
deviceID int,
day timestamp,
timepoint timestamp,
feeds map<text,int>,
PRIMARY KEY ((deviceID,day),timepoint)
) WITH
clustering order by (timepoint DESC);
insert into ddata (deviceID,day,timepoint,feeds) values (1000001,'2013-09-01','2013-09-01 00:00:04',{'temp':73,'humidity':99});
deviceid | day | timepoint | event | feeds
----------+--------------------------+--------------------------+------------+----------------------------------------------------------
1000001 | 2013-09-02 00:00:00-0700 | 2013-09-02 00:00:04-0700 | null | {'humidity': 97, 'temp': 93}
1000001 | 2013-09-02 00:00:00-0700 | 2013-09-02 00:00:03-0700 | null | {'temp': 92}
1000001 | 2013-09-02 00:00:00-0700 | 2013-09-02 00:00:02-0700 | null | {'temp': 91}
1000001 | 2013-09-02 00:00:00-0700 | 2013-09-02 00:00:01-0700 | null | {'temp': 90}
What are folks thoughts on the two options?
- From what I can see the first option will allow better typing of the different datatypes (int versus float), but makes the table kind of ugly.
- Will performance be better if I avoid using the collection types?
Is continuously adding additional columns as new sensor data types are added anything to worry about?
What other factors should I consider?
- What other data modeling ideas do folks have for this scenario?
Thanks, Chris