1
votes

I'm new in Cassandra and I'm trying to learn a bit more of how this DB engine works (specially the CQL part ) and compare it with Mysql.

With this in mind I was trying some query's, but there is one particular query that I can't figure out. From what I could read it seams that it's not possible to do this query in Cassandra, but I would like to know for sure if there is somework around that.

Imagine the following table [Customer] with PRIMARY_KEY = id:

id, name, city, country, email 
01, Jhon, NY, USA, jhon@
02, Mary, DC, USA, mary@
03, Smith, L, UK, smith@
.....

I want to get a listing that shows me how many customers I have per country and ORDER BY DESC.

In mySQL it would be something like

SELECT COUNT(Id), country 
FROM customer
GROUP BY country
ORDER BY COUNT(Id) DESC

But in Cassandra (CQL) it seems that I can't do GROUP BY of columns that aren't PRIMARY_KEY (like the case of "country" ), is there anyway arround this ???

2
Although CQL resembles the SQL, it's not the same... To perform the things like aggregations, sorting, etc., you need to model your table correct way. I recommend to take DS220 course about data modelling in Cassandra: academy.datastax.com/resources/ds220Alex Ott

2 Answers

0
votes

You need to define a secondary index on "country". Secondary indexes are used to query a table using a column that is not normally query table.

For ORDER BY you define clustering keys on 'id'.Clustering keys are responsible for sorting data within a partition.

0
votes

The main thing to remember when building a table in Cassandra, is to model its PRIMARY KEY based on how you plan to query it. In any case, defining id as the PRIMARY KEY isn't very helpful for what you're trying to do.

Also, keywords like GROUP BY and ORDER BY have special requirements. ORDER BY specifically is pretty useless (IMO), unless you plan to reverse the sort direction. But you cannot pick an arbitrary column to sort your data by.

To solve for your query above, I'll create a new table, keyed on the country, city, and id columns (in that order):

CREATE TABLE customer_by_city (
  id TEXT,
  name TEXT,
  city TEXT,
  country TEXT,
  email TEXT,
  PRIMARY KEY (country,city,id)
) WITH CLUSTERING ORDER BY (city ASC, id DESC);

Now, I'll INSERT the rows:

INSERT INTO customer_by_city (id,name,city,country,email)
     VALUES ('01', 'Jhon', 'NY', 'USA', '[email protected]');
INSERT INTO customer_by_city (id,name,city,country,email)
     VALUES ('02', 'Mary', 'DC', 'USA', '[email protected]');
INSERT INTO customer_by_city (id,name,city,country,email)
     VALUES ('03', 'Smith', 'London', 'UK', '[email protected]');

SELECT COUNT(Id), country  FROM customer_by_city  GROUP BY country ;

 system.count(id) | country
------------------+---------
                2 |     USA
                1 |      UK

(2 rows)

Warnings :
Aggregation query used without partition key

Notes:

  • That last message means you're running a query without a WHERE clause keyed by the partition key. That means Cassandra is going to have to check every node in the cluster to serve this query. Highly inefficient.
  • While it works for this example, country as a partition key may not be the best way to distribute data. After all, if most of the customers are in one particular country, then they could potentially push the bounds of the maximum partition size.