Cassandra datamodelling

Question

I have the following table to store time-series data:

CREATE TABLE alerts_by_year_day (
    day_of_year int,
    year int,
    alert_timestamp timestamp,
    serial_number text,
    alert_id uuid,
    alert_type text,
    ....
    ....
  PRIMARY KEY((year, day_of_year), alert_timestamp, serial_number, alert_id)
) WITH CLUSTERING ORDER BY (alert_timestamp DESC, serial_number DESC);

For the UI report I want to retrieve all the alerts for a given period of time. I have this query:

select * from alerts_by_year_day where year = 2015 and day_of_year in (241, 240);

But this query is returning me results that are in the ASC order of the year and then ASC order of day. So results like this

2015 | 240 |.....

2015 | 241 |.....

But I want to display the latest results first or in the descending order. Adding 'order by alert_timestamp;' gives an error So how can I display results that are in the descending order ?

Then for a given period of time I want to retrieve only certain types of alert based on alert_type.

So I created a mat view like this:

CREATE MATERIALIZED VIEW alerts_by_type_and_timestamp AS 
    SELECT *
    FROM alerts_by_year_day
    WHERE alert_timestamp IS NOT NULL AND
        alert_type IS NOT NULL AND 
        day_of_year IS NOT NULL AND 
        year IS NOT NULL AND serial_number IS NOT NULL AND 
        alert_id IS NOT NULL
    PRIMARY KEY ((year, day_of_year, alert_type), alert_timestamp, serial_number, alert_id)
    WITH CLUSTERING ORDER BY (alert_timestamp DESC, serial_number DESC, alert_id DESC);

But of course it returns results by type first and in that by timestamp. What I am looking for is only a subset of types and in the desc order they were generated in. Is that possible in Cassandra ?

Thanks

Chris Lohfink Chris Lohfink · Accepted Answer · 2016-12-02T10:56:48

The order of the partitions is in token order. Its the order of the murmur3 hash of the primary key. ie:

cqlsh:test> select * from alerts_by_year_day ;

 year | day_of_year | alert_timestamp                 | serial_number | alert_id                             | alert_type
------+-------------+---------------------------------+---------------+--------------------------------------+------------
 2015 |          10 | 1970-01-01 00:00:00.001000+0000 |          s123 | b7baa710-b87b-11e6-9137-eb2177fd2cc2 |       type
 2015 |         110 | 1970-01-01 00:00:00.001000+0000 |          s123 | bf110270-b87b-11e6-9137-eb2177fd2cc2 |       type
 2015 |          11 | 1970-01-01 00:00:00.001000+0000 |          s123 | bce08de1-b87b-11e6-9137-eb2177fd2cc2 |       type
 2016 |         110 | 1970-01-01 00:00:00.001000+0000 |          s123 | c2e22eb1-b87b-11e6-9137-eb2177fd2cc2 |       type

Its because of your IN query that its walking them in that order (which you cant control). Under cover has to make a separate query for each combination of the primary key.

This requires multiple fetches per value in the in clause and can get inefficient fast if you put too much in it since it puts a lot of burden on your coordinator. It costs almost same to just make two async select queries. Which you then can read in the order you want. This also saves you from making a single coordinator in your cluster manage fetches to many nodes, ultimately this can help with cluster health. 1 query per day isn't bad at all to iterate through in your application.

If the days are not "everyday" may want to consider a 2nd table thats just (year, day_of_year) and (type, year, day_of_year) that you write to when you do your insert before you do your query you can.

note: can keep local in memory cache so you dont have thousands of unnecessary writes, can write just once but it is ok to write multiple times incase multiple instances of app or restarts

year = 2015
days = query('select * from alert_day_index where year = %s', year)
results = []
for day in days:
  results.extend(query('select * from alerts_by_year_day where year = %s and day_of_year = %s', year, day))

If you have a lot of days just need to make queries async so the latency of the queries doesn't block throughput of app.

Cassandra datamodelling

1 Answers