I caveat this question by stating: I am somewhat new to NoSQL and very new to Cassandra, but it seems like it might be a good fit for what I'm trying to do.
Say I have a list of sensors giving input at reasonable intervals. My proposed data model is to partition by the name of the sensor, where it is (area) and the date (written as yyyyMMdd), and the cluster the readings for that day by the actual time the reading occurred. The thinking is that the query for "Get all readings from sensor A on date B" should be extremely quick. So far so good I think. The table / CF looks like this in CQL:
CREATE TABLE data (
area_id int,
sensor varchar,
date ascii,
event_time timeuuid,
PRIMARY KEY ((area_id, sensor, date), event_time)
) WITH CLUSTERING ORDER BY (event_time DESC);
This doesn't however actually include any data, and I'm not sure how to add this to the model. Each reading (from the same sensor) can have a different set of arbitrary data, and I won't know ahead of time what this. E.g. I could get temperature data, I could get humidity, I could get both, or I could get something I haven't seen before. It's up to the person who actually recorded the data as to what they want to submit (it's not reading from automated sensors).
Given that I want to be doing query operations on this data (which is basically UGC) what are my options? Queries will normally consist of counts on the data (e.g. Count readings from sensor A on date B where some_ugc_valueX = C and some_ugc_valueY = D). It is worth noting that there will be more data points than would normally be queried at once. A reading could have 20 data values, but maybe only 2 or 3 would be queried - just it's unknown which ahead of time.
Currently I have thought of:
- Store the data for each sensor reading in as a Map type. This would certainly make the model simple, but my understanding is that querying would then be difficult? I think I would need to pull the entire map back for each sensor reading, then check the values and count it outside of Cassandra in Storm/Hadoop/whatever.
- Store each of the user values as another column (composite column with event_time uuid). This would mean not using CQL as that doesn't support adding arbitrary new columns at insert time. The Thrift API does however allow this. This means I can get Cassandra to do the counting itself.
Maybe I'm going about this the wrong way? Maybe Cassandra isn't even the best choice for this kind of data?