1
votes

With Cassandra it is possible to specify the cluster ordering on a table with a particular column.

CREATE TABLE myTable (
    user_id INT,
    message TEXT,
    modified DATE,
    PRIMARY KEY ((user_id), modified)
)
WITH CLUSTERING ORDER BY (modified DESC);

Note: In this example, there is one message per user_id (intended)

Given this table my understanding is that the query's performance will be better in cases where recent data is queried.

However, if one where to make updates to the "modified" column does it add extra overhead on the server to "re-order" and is that overhead vs query performance significant?

In other words given this table would it perform better if the "CLUSTERING ORDER BY (modified DESC)" was dropped?

UPDATE: Updated the invalid CQL by adding modified to primary key, however, the original questions still stand.

3
Hint: That table definition is not valid CQL.Aaron
I compared it with the table definition here cassandra.apache.org/doc/latest/cql/ddl.html#create-table and seems valid, can you please point out the problemExocomp
The issue is that modified is not defined as the clustering key, so you can't define a clustering order on it. To fix this, the primary key should be defined as PRIMARY KEY (user_id, modified). For more information regarding the composite key, and the characteristics of the clustering key stackoverflow.com/questions/24949676/…Carlos Monroy Nieblas
My mistake it was an oversight, I had the date as the primary key on the server just not in my example here.Exocomp
also note that if you want to set the clustering order by modified, you won't be able to update that record (as explained in stackoverflow.com/questions/27075596/…); Cassandra is an append-only database engine: this means that any update to the records will add a new record with a different timestamp, a select will consider the records with the latest timestamp. This means that there is no "re-order" operation ever.Carlos Monroy Nieblas

3 Answers

1
votes

In order to make modified a clustering column, it needs to be defined in the primary key.

CREATE TABLE myTable (
    user_id INT,
    message TEXT,
    modified DATE,
    PRIMARY KEY ((user_id), modified)
)
WITH CLUSTERING ORDER BY (modified DESC);

This way, your data will be sorted primarily by the hashed value of the user_id, and within each user_id by modified. You don't need to drop the "WITH CLUSTERING ORDER BY (modified DESC)"

0
votes

Moving the comment as an answer, as reply of the updated question:

if one where to make updates to the "modified" column does it add extra overhead on the server to "re-order" and is that overhead vs query performance significant?

If modified is defined as part of the clustering key, you won't be able to update that record, but you will be able to add as many records as needed, each time with a different modified date.

Cassandra is an append-only database engine: this means that any update to the records will add a new record with a different timestamp, a select will consider the records with the latest timestamp. This means that there is no "re-order" operation.

Dropping or creating the clustering order should be defined in base of the query of how the information will be retrieved, if you are going to use only the latest records of that user_id, it makes sense to have the clustering order as you defined it.

0
votes

in your data model user_id is a rowkey/shardkey/partition key (userid) that is important for data locality and the clustering column (modified) specifies the order that the data is arranged inside the partition. combination of these two keys makes the primary key.

Even in RDBS world, updating PK is avoidble for sake of data integrity.

however in cassandra there is no constraints/relation between column families/tables. Assigning exact same values to Pk fields(userid,modified) will result in update the existing record else it will add set of fields.

refence: https://www.datastax.com/dev/blog/we-shall-have-order