MAX(), DISTINCT and group by in Cassandra

18

votes

I am trying to remodel a SQL database Cassandra such that, I can find the Cassandra equivalent for the SQL queries. I use CQL 3 and Cassandra v1.2. I modeled the db design in cassandra so that it supports the order by clauses and denormalized tables to support the join operation. However I am at sea when it comes to DISTINCT, SUM() and GROUPBY equvalents

SELECT a1,MAX(b1) FROM demo1 group by a1.
SELECT DISTINCT (a2) FROM demo2 where b2='sea'
SELECT sum(a3), sum(b3) from demo3 where c3='water' and d3='ocean'

This is like a showstopper to my work for past couple of days. Is there a way in Cassandra, that I can model the db schema to support queries of these kind? I cant think of any way in Cassandra . How are such queries be implemented using Cassandra?

I read that a hive layer over Cassandra can possibly make these queries work. I am just wondering if that is the only way that such queries can be supported in Cassandra..? Pls advise on any other possible methods..

nosqlcassandracqlcql3nosql-aggregation

10

votes

Cassandra doesn't support operations like this. You can use something like Hive on top or there's a (non-free) product from Acunu that may do what you need.

The other solution is to do the work yourself. For example, you can sum things by reading in all the data from certain rows and summing. Or maintain a Cassandra counter to increment on the fly.

26

votes

With Cassandra you solve these kinds of problems by doing more work when you insert your data -- which sounds like it would be slow, but Cassandra is designed for fast writes, and you're probably going to read the data many more times than you write it so it makes sense when you consider the whole system.

I can't tell you exactly how to create your tables to model your problem because it will depend a lot on the details. You need to figure a schema that lets you get the data without performing any on-the-fly aggregations. Think about how you would create views for the queries in an RDBMS, and then try to think how you would insert data directly into those views, not into the underlying tables. That's kind of how you model things in Cassandra.

22

votes

Although this is an old question, it appears in Google search results pretty high. So I wanted to give an update.

Cassandra 2.2+ supports user defined function and user defined aggregates. WARNING: this does not mean that you don't have to do data modeling anymore (as it was pointed by @Theo) rather it just allows you to slightly preprocess your data upon retrieval.

SELECT DISTINCT (a2) FROM demo2 where b2='sea'

To implement DISTINCT, you should define a function and an agreggate. I'll call both the function and the aggregate uniq rather than distinct to emphasize the fact that it is user defined.

CREATE OR REPLACE FUNCTION uniq(state set<text>, val text)
  CALLED ON NULL INPUT RETURNS set<text> LANGUAGE java
  AS 'state.add(val); return state;';
CREATE OR REPLACE AGGREGATE uniq(text)
  SFUNC uniq STYPE set<text> INITCOND {};

Then you use it as follows:

SELECT uniq(a2) FROM demo2 where b2='sea';

SELECT sum(a3), sum(b3) from demo3 where c3='water' and d3='ocean'

SUM is provided out of the box and works as you would expect. See system.sum.

SELECT a1,MAX(b1) FROM demo1 group by a1

GROUP BY is a tricky one. Actually, there is no way to group result rows by some column. But what you can do is to create a map<text, int> and to group them manually in the map. Based on an example from Christopher Batey's blog, group-by and max:

CREATE OR REPLACE FUNCTION state_group_and_max(state map<text, int>, type text, amount int)
  CALLED ON NULL INPUT
  RETURNS map<text, int>
  LANGUAGE java AS '
    Integer val = (Integer) state.get(type);
    if (val == null) val = amount; else val = Math.max(val, amount);
    state.put(type, val);
    return state;
  ' ;

CREATE OR REPLACE AGGREGATE state_group_and_max(text, int) 
  SFUNC state_group_and_max
  STYPE map<text, int> 
  INITCOND {};

Then you use it as follows:

SELECT state_group_and_max(a1, b1) FROM demo1;

Notes

As it was mentioned above, you still have to invest some time in data modeling, don't overuse these features
You have to set enable_user_defined_functions=true in your cassandra.yaml to enable the features
You can overload the functions to support grouping by columns of different types.

References:

16

votes

Cassandra 3.10 now supports Group by parition key and clustering key. You can refer to this link for more detail.

MAX(), DISTINCT and group by in Cassandra

4 Answers

Notes

References: