Choosing Cassandra schema for hydrological time series data

Question

I'm new to NoSQL and Cassandra and would like some input on my decision in choosing the schema design for my database for handling hydrological data. As a side note I'm developing the application in Python and have been toying around with the Datastax Python Driver.

In short, the incoming raw data is stored as comma-separated values in different files depending on the measuring interval/type of reading, basically depending on the variable I call sensor_id. For instance, the hourly data looks like this:

sensor_id (int), year (int), julianday (int), hourminute (int) , data1 (float), data2 (float), data3 (float)..

..so a sample reading would look something like:

715, 2015, 15, 2230, 3.65, 6.12, 95.2 ,...
715, 2015, 15, 2330, 4.12, 5.12, 87.2 ,...

where 715 would indicate that it's an hourly reading, so a daily reading would use another sensor_id. Let's call it 716 and a sample reading would look something along these lines:

716, 2015, 15, 3.52, 5.23, 84.5,..

For the daily reading we're obviously skipping the time details since the reading is only gathered once per julian day. You probably get the point.

So each location has its own reading intervals where the number of data columns, i.e. parameters of interest is depending on the interval/type of reading. For instance, the daily data consist of ~20 columns and the hourly ~15.

Queries of interest:

I want to be able to run a simple quality control on some of the data parameters, probably mainly air temperature, water temperature, wind speed and water level, and maybe a few others as well. To begin with, this will probably be to check whether the parameter value exceeds of falls below a given max-min threshold. I also want to be able to plot the raw data, the quality controlled data, and the values that didn't pass the QC test.

Is a particular value within a given threshold (<,>)?
Get all parameters at a given interval (sensor_id) and time (mainly used for plotting)

My initial thoughts was too just convert the date/time parameters to timestamp and just insert each row/reading into a table with the same structure as the raw file. For instance, the location_hourly would look (using the values above)

+-----------+---------------------+-------+-------+-------+--------+
| sensor_id |      timestamp      | data1 | data2 | data3 | data n |
+-----------+---------------------+-------+-------+-------+--------+
|       715 | 2015-01-01 22:30:00 | 3.65  | 6.12  | 95.2  |        |
|       715 | 2015-01-01 23:30:00 | 4.12  | 5.12  | 87.2  |        |
+-----------+---------------------+-------+-------+-------+--------+

and set the primary key (sensor_id, timestamp) in which the sensor_id would become the partition key and the timestamp the clustering key. This would obviously work in the cases we want to fetch all values taken at a particular time, but not for the range queries we're interested in when performing the QC.

This is where I'm at, and I'd love to hear your views on this.

Simha Simha · Accepted Answer · 2015-06-22T03:35:08

I'm quite new to Cassandra, but I think I can share some knowledge here.

To enable range queries on weather parameters, you'll need to create separate tables where those weather parameters act as clustering columns. For example, to be able to query on wind speed range:

CREATE TABLE location_by_wind_speed (sensor_id bigint, weather_speed int,
    date timestamp, data1 ..., PRIMARY KEY (sensor_id, weather_speed) )

Then you'll be able to query like this:

SELECT * FROM location_by_wind_speed WHERE sensor_id=123 
  AND wind_speed > 20 AND wind_speed < 100

To be able to filter results by time, you can use a composite partition key (described here). For example:

CREATE TABLE location_by_wind_speed_year (sensor_id bigint, weather_speed int,
    date timestamp, data1 ..., PRIMARY KEY ((sensor_id, year), weather_speed) )

This will allow you to select recordings within a particular year. If you need results sorted by time, you can sort in your app's code.

Choosing Cassandra schema for hydrological time series data

1 Answers