Cassandra time sliced data model for unknown data

Question

I caveat this question by stating: I am somewhat new to NoSQL and very new to Cassandra, but it seems like it might be a good fit for what I'm trying to do.

Say I have a list of sensors giving input at reasonable intervals. My proposed data model is to partition by the name of the sensor, where it is (area) and the date (written as yyyyMMdd), and the cluster the readings for that day by the actual time the reading occurred. The thinking is that the query for "Get all readings from sensor A on date B" should be extremely quick. So far so good I think. The table / CF looks like this in CQL:

CREATE TABLE data (
    area_id int,
    sensor varchar,
    date ascii,
    event_time timeuuid,
    PRIMARY KEY ((area_id, sensor, date), event_time)
) WITH CLUSTERING ORDER BY (event_time DESC);

This doesn't however actually include any data, and I'm not sure how to add this to the model. Each reading (from the same sensor) can have a different set of arbitrary data, and I won't know ahead of time what this. E.g. I could get temperature data, I could get humidity, I could get both, or I could get something I haven't seen before. It's up to the person who actually recorded the data as to what they want to submit (it's not reading from automated sensors).

Given that I want to be doing query operations on this data (which is basically UGC) what are my options? Queries will normally consist of counts on the data (e.g. Count readings from sensor A on date B where some_ugc_valueX = C and some_ugc_valueY = D). It is worth noting that there will be more data points than would normally be queried at once. A reading could have 20 data values, but maybe only 2 or 3 would be queried - just it's unknown which ahead of time.

Currently I have thought of:

Store the data for each sensor reading in as a Map type. This would certainly make the model simple, but my understanding is that querying would then be difficult? I think I would need to pull the entire map back for each sensor reading, then check the values and count it outside of Cassandra in Storm/Hadoop/whatever.
Store each of the user values as another column (composite column with event_time uuid). This would mean not using CQL as that doesn't support adding arbitrary new columns at insert time. The Thrift API does however allow this. This means I can get Cassandra to do the counting itself.

Maybe I'm going about this the wrong way? Maybe Cassandra isn't even the best choice for this kind of data?

lorcan lorcan · Accepted Answer · 2013-11-06T17:44:16

tl;dr. you can't chose both speed and absolute flexibility ;-)

Queries based on data from User Generated Content is going to be complex - you aren't going to be able to produce a one-size-fits-all table definition that will allow quick responses for queries based on UGC-content. Even if you choose to use Maps, Cassandra will have to deserialize the entire data structure on every query so it's not really an option for big Maps - which as you suggest in your question is likely to be the case.

An alternative might be to store the sensor data in a serialised form, e.g., json. This would give maximum flexibility in what is being stored - at the expense of being unable to make complex queries. The serialization/deserialization burden is pushed to the client and all data is sent over the wire. Here's a simple example:

Table creation (slightly simpler than your example - I've dropped date):

create table data(
  area_id int, 
  sensor varchar, 
  event_time timeuuid, 
  data varchar, 
  primary key(area_id,sensor,event_time)
);

Insertion:

insert into data(area_id,sensor,event_time,data) VALUES (1,'sensor1',now(),'["datapoint1":"value1"]');
insert into data(area_id,sensor,event_time,data) VALUES (1,'sensor2',now(),'["datapoint1":"value1","count":"7"]');

Querying by area_id and sensor:

>select area_id,sensor,dateof(event_time),data from data where area_id=1 and sensor='sensor1';

 area_id | sensor  | dateof(event_time)       | data
---------+---------+--------------------------+-------------------------
       1 | sensor1 | 2013-11-06 17:37:02+0000 | ["datapoint1":"value1"]

(1 rows)

Querying by area_id:

> select area_id,sensor,dateof(event_time),data from data where area_id=1;

 area_id | sensor  | dateof(event_time)       | data
---------+---------+--------------------------+-------------------------------------
       1 | sensor1 | 2013-11-06 17:37:02+0000 |             ["datapoint1":"value1"]
       1 | sensor2 | 2013-11-06 17:40:49+0000 | ["datapoint1":"value1","count":"7"]

(2 rows)

(Tested using [cqlsh 4.0.1 | Cassandra 2.0.1 | CQL spec 3.1.1 | Thrift protocol 19.37.0].)

Cassandra time sliced data model for unknown data

1 Answers