4
votes

I work with following table:

CREATE TABLE IF NOT EXISTS lp_registry.domain (
    ownerid text,
    name1st text,
    name2nd text,
    name3rd text,
    registrar text,
    registered timestamp,
    expiration timestamp,
    updated timestamp,
    technologies list<text>,
    techversions list<text>,
    ssl boolean,
    PRIMARY KEY (
        (name1st, name2nd, name3rd), 
        registrar, ownerid, registered, expiration, updated
    )
);

Table isn't updated only new rows are added. Everytime crawler checks domain, new row is added.

I am performing this select:

SELECT * FROM lp_registry.domain WHERE 
    registrar = 'REG-WEDOS' AND 
    ownerid = 'FORPSI-JAF-S497436' 
ALLOW FILTERING;

But what I want in the result are only the rows with latest 'updated' value for each unique "name3rd.name2nd.name1st".

If I were in a standard SQL database, I would use nested select with MAX or GROUP BY. However, this is not supported by Cassandra (MAX(), DISTINCT and group by in Cassandra). But what I should do in CQL?

2

2 Answers

3
votes

Extending onto Cedric's answer (which is great advice and would consider that as the answer to accept) you would get a table structure roughly like:

CREATE TABLE IF NOT EXISTS lp_registry.domain (
    ownerid text,
    name1st text,
    name2nd text,
    name3rd text,
    registrar text,
    registered timestamp,
    expiration timestamp,
    updated timestamp,
    technologies list<text>,
    techversions list<text>,
    ssl boolean,
    PRIMARY KEY ((registrar, ownerid), updated, name1st, name2nd, name3rd)
) WITH CLUSTERING ORDER BY (updated desc);

When selecting data it will return rows with the most recent updated values within the partition for the registrar and ownerid you are querying.

This query would be incredibly fast because your data will be organized on disk by registrar, owner id with rows in order by updated descending.

This is a key concept with cassandra in that your data is organized based on how you query it. You lose flexibility in your queries, but you can feel comfortable that you are going to get great performance because you are retrieving data as it is organized. This is why denormalizing your data based on your queries is vital.

Where things become complicated is if you wanted to retrieve the most recently updated of all data. That problem is not easily solvable with cassandra unless everything shares the same partition which has its own set of problems (example strategy using a 'dummy' partition key).

3
votes

The whole schema should be modified. The SELECT you're doing, apparently an important one from your application point of view should not require ALLOW FILTERING: you should denormalize your data and create a table where registrar and ownerid are partition keys.

In that denormalized structure, updated should be a partition key, sorted with DESC. The query would then be

SELECT * FROM lp_registry.domain WHERE registrar='XXX' AND ownerid='YYY' LIMIT 10;

As you're saying that rows are inserted but never updated, it should not be complicated in your application to insert new data in, if necessary, more than one denormalized tables.

Andy's answer provides more details and an example for your table structure.