0
votes

Let's say you have a simple Cassandra schema (using CQL3) that saves the readings of various sensors at a per second level. (time series)

create table sensor_readings (
sensorid varchar,   
time    timestamp,
value varchar,
primary key (sensorid, time)
) 

Is there an efficient way to retrieve the data at a different time periodicity than was recorded?

For instance, how would you retrieve only hourly data or daily data when all the data is stored on a per second basis?

Do you have to retrieve all the data into your application, and then have your application filter out the data? Or can Cassandra do this for you?

4

4 Answers

0
votes

No this is you can say limitation of cassandra, their is no date now function like any RDBMS database, so direly you can-not have facility to retrieve and filter out the data.

as you mention you have retrieve all the data and then after using java-script or something you have to use for your filtration.

0
votes

Cassandra does not do any aggregation on your behalf. So if you need to group data with a specific time granularity, you'll need to compact your data points into an appropriately sized time window bucket at insertion time. This is a common pattern when dealing with time series data in Cassandra. See this article for an example.

Alternatively, you could aggregate at query time in your app, but you will be streaming lots of data out of Cassandra. You'll most likely reduce the overall IO load if you can do this at insertion time.

0
votes

As cassandra does not support such feature you have two different options. You can either calculate aggregation data before a persistence process (and persist such entity later) when data is being initially persisted. Or second - later after successful persistence, this is usually called "on demand".

When to use first and when to use second option?

First is obviously more difficult and takes more resources. The point is that if you always need that stats (aggregation) and it's crucial point of your app, you should calculate them during the process. If there is much more lower use of aggregated data you can just calculate them on demand once a user asks for them.

Well, as I'm describing the problem here, both approaches seem like classical difference between RDBSM indexing approach and usually missing indexes in NoSQL. RDBMS calculates them always but for some performance penalty and potential problems with locking. On the other hand, you are usually required to calculate indexes on your own once you use NoSQL.

0
votes

The classic Cassandra way is to design additional tables that would support your queries. This is a concise paper describing the process.

So, say for the hourly data you'll have another table like

create table sensors_by_hour (
hour text,
sensorid varchar,
value varchar,
primary key (hour, sensorid)
) 

and you'll write to both tables at the same time, obviously for the second one cutting the timestamp at hour level. This'll give you sensors and values for a particular hour.

Etc.