How do I group by a field in a certain time range in Cassandra?

Question

I have a big Cassandra table with around 15 fields, and I wanted to create many materialized views to support all of my queries. The Primary Key on the base table is (CompanyName, ctime) where ctime is a timeuuid. The table records every click on a certain company website at a certain timestamp.

For instance, one of my queries includes searching what are the most used browsers on a certain time range (max. one year).

So I did

CREATE MATERIALIZED VIEW clicks_by_browser 
    AS SELECT CompanyName, ctime, browsername
    FROM companyclicks
    WHERE CompanyName is not null AND ctime is not null AND browsername is not null
    PRIMARY KEY(CompanyName, ctime, browsername)

But this does not work. When I do the query:

SELECT browsername, count(*) from clicks_by_browser
    WHERE CompanyName='example' and id>=minTimeuuid(...) 
    GROUP BY browsername;

Cassandra rejects it because "Group by only support groups of columns following their declared order in the primary key". So the issue is I declared ctime before browsername in the primary key.

So I tried inverting their order as clustering columns,

CREATE MATERIALIZED VIEW clicks_by_browser 
    AS SELECT CompanyName, ctime, browsername
    FROM companyclicks
    WHERE CompanyName is not null AND ctime is not null AND browsername is not null
    PRIMARY KEY(CompanyName, browsername, ctime)

But now Cassandra rejects the same query because ctime can't be restricted in the WHERE clause as preceding column browsername is not restricted.

So is such a query just not possible in Cassandra currently? Or am I missing something obvious?

Abhishek Garg Abhishek Garg · Accepted Answer · 2019-07-18T18:14:55

TLDR: It is not possible.

Long Answer:
As per the Cassandra JIRA, Cassandra currently doesn't implement group by on arbitrary columns. When grouping by multiple columns, you can only group by primary key columns in the order they are declared in the primary key, starting from the first.

The only exception allowed is if you have restricted the first n columns with an equality restriction. Note, restricting a column with an "=" is selecting one group, so there is nothing to group in that column, that's why Cassandra allows this. You can then group by the remaining columns, in order and starting from the next column (you can’t skip columns in between).

So to group by a column, all preceding primary key columns have to be either restricted by "=" or be in the group by clause.

The range query like the one above fails because the sample column is being restricted by a range, thus still having multiple groups, but is not in the group by clause. LIKE and IN restrictions will also not work in this case.

I think the best thing you can try is including timeuuid in group by and then doing the aggregate in the application.

SELECT browsername,id, count(*) from clicks_by_browser
    WHERE CompanyName='example' and id>=minTimeuuid(...) 
    GROUP BY id,browsername;

How do I group by a field in a certain time range in Cassandra?

1 Answers