Issue with cassandra data modelling

Question

We want to use Apache Cassandra for storing big data gathered from realm time sensor data. We have developed an IOT platform capable to handle 1 million events per second. We want to persist them in Cassandra.

Our table looks like :

Sensor_data_by_date
Realm   text    K
Bucket  int K
dateTimeReceived    timestamp   Clustering column
sensor_id   text    
Message_id  text    
Sensor_name text

Query we are interested in is :

Give me all results for all sensor data for “realm-a” for a dateTime range say “5th may” to “12th may” order by “dateTimeReceived”.

Solution : Since our platform can handle upto 1 million events per second, when I even try to include DATE + HOUR as partition key, it will still increase the maximum recommended size by Casssandra. So we decided to keep bucket along with realm as partition key.

Problem : Now say when we have a wide range of date range as mentioned (5th may to 13th may), we will have multiple buckets to lookup from. We also need to support ordering. When we have this in place, I need to use “IN clause” for buckets say :

where realm=realm-a and bucket in (1,2,3,4) 
and dateTimeReceived>… 
and dateTimeReceived <… order by dateTimeReceived

This would complain that IN clause and order by cant work together with pagination. I need to have pagination as well…..

Can you please help me how to achieve this functionality?

Help will be much appreciated.

xmas79 xmas79 · Accepted Answer · 2017-05-15T09:47:29

You don't need (well, you don't have to...) specify an IN clause. If you fill the buckets by time-of-arrival of your events, that is bucket X is always older than bucket X-1 (or if you store data in reverse order bucket X is always newer than bucket X-1) you already know the order of your bucket. Hence the only thing left is to query each partition (in parallel, no IN) and sort by bucket. Inside each bucket Cassandra can sort by timestamp. The last step is to process these events in bucket order (or reverse bucket order).

About your DATA + HOUR problem, you can use your bucketing technique, or you could go down to DATA + HOUR + QUARTER OF HOUR. Here's some suggestions I gave to similar problems:

HTH.

Issue with cassandra data modelling

1 Answers