1
votes

I'm trying to model a table in Cassandra, I'm quite new and stumbled upon one problem. I've got the following:

CREATE TABLE content_registry (
    service text,
    file text,
    type_id tinyint,
    container text,
    status_id tinyint,
    source_location text,
    expiry_date timestamp,
    modify_date timestamp,
    create_date timestamp,
    to_overwrite boolean,
    PRIMARY KEY ((service), file, type_id)
);

So as I understand:

  • service is my partition key and based on this value hashes will be generated and values will be split in cluster
  • file is clustering key
  • type_id is clustering key
  • These three bodies combine a composite (compound) primary key

What I've figured out is that whenever I'll insert new data, Cassandra will upsert (either insert or update if the value with that compound primary key exists)

Now what I'm struggling is, that I want my data to come back sorted by create_date in descending order, however create_date is not part of primary key.

If I add create_date to my primary key, I won't be able to upsert data, because create_date means timestamp when record was inserted, so if I add it to primary key every time there's an insert, I'll end up with multiple records.

What are the other options? Order in application? That doesn't seem very efficient.

3

3 Answers

2
votes

What I've figured out is that whenever I'll insert new data, Cassandra will upsert (either insert or update if the value with that compound primary key exists)

Totally right.

Now what I'm struggling is, that I want my data to come back sorted by create_date in descending order, however create_date is not part of primary key. If I add create_date to my primary key, I won't be able to upsert data, because create_date means timestamp when record was inserted, so if I add it to primary key every time there's an insert, I'll end up with multiple records.

With these sentences you are actually contradicting.

If create_date isn't part of your key but a property and the data is upserted, it means that the records are always the same. Therefore when querying by the key and fetching create_date you always have the latest. If you actually want to have the date when the record got created you should just not override the data anymore after the first time you inserted that record.

If it's the case you want to represent a series of data, you indeed need to avoid upserting, this is could be done by using create_date as additional partition key. I'd rather prefeer using time_uuid which comes with quite handy functions.

Last but not least, the most interesting question is, what actually the usecase is that you want to reflect. When modelling data in cassandra you always should know your queries you need to run in advance.

2
votes

The key concept in Cassandra is that you have to decide what's your PRIMARY KEY, that is what in your rows can be unique and known at query times. This is a very basic requirement, since failing at recognizing this will lead to a bad model.

From what I can see, you identified service as your PARTITION KEY, so I'm thinking that this field is what "rules" your data. This is something you must really know to perform even a single query (ignoring the inefficient table scan SELECT * FROM content_registry;). Within each service, you currently have your rows ordered by file and then by type_id. I don't know the exact meaning of the latter field, but you can currently have two rows identified by ('service1', 'a.jpg', 1) and ('service1', 'a.jpg', 2). So if type_id is somehow related to the file, the model is a bit incorrect.

Now, assuming you want to fetch the same records for each service in another order, what you really need to do is create another table that will include the create_date as the first clustering column, eg (service, create_date, file, type_id). This will allow you to fetch records ordered by creation date, and when two records are created in the same date, they will be further ordered by file, and then by type_id.

A second approach is to attach a secondary index to the create_date field of your original table. This will allow to query by creation date.

A third approach, probably better than the second, is the use of a Materialized View. It will hide a lot of burdens for you and will probably scale better than secondary indexes.

Please note that having secondary indexes or materialized views usually don't scale well. Check if these approaches are enough for your use case.

0
votes

If I add create_date to my primary key, I won't be able to upsert data.

Why not? Suppose your key was PRIMAY KEY (service, create_date, file, type_id)? That will let you sort by create_date for each service but not globally.

If you want to do it globally (that is, you want all services and all files sorted by create date) then things are probably more complex if you still want to be able to shard your data. One option would be to make the primary key PRIMARY KEY (create_date, service, file, type_id) and use one of the order preserving partitioners.

Also, a bit more information here: http://www.datastax.com/dev/blog/we-shall-have-order