why clustering key is named “clustering key” in cassandra?

Question

Table-1:

create table mylistofitems (listid int, 
  itemid int, 
  quantity int, 
  itemdesc text, 
  primary key ((listid, itemid), itemdesc));

In the above table I'm doing the following inserts:

insert into mylistofitems (listid, itemid, itemdesc, quantity) values (1, 1000, 'apple', 5);
insert into mylistofitems (listid, itemid, itemdesc, quantity) values (1, 1000, 'banana', 10);
insert into mylistofitems (listid, itemid, itemdesc, quantity) values (1, 1000, 'orange', 6);
insert into mylistofitems (listid, itemid, itemdesc, quantity) values (1, 1000, 'orange', 50);

when I do select * from mylistofitems I get the following:

 listid | itemid | itemdesc | quantity
--------+--------+----------+----------
      1 |   1000 |    apple |        5
      1 |   1000 |   banana |       10
      1 |   1000 |   orange |       50

Second insert statement did NOT overwrite the 1st row. But the fourth insert statement has overwritten the third row.

In this context what is the meaning of clustering key?

Table-2:

create table myitems (listid int,
  itemid int, 
  idesc text, 
  qty int,
  primary key (listid, itemid));

I insert the following records into table-2:

insert into myitems (listid, itemid, idesc, qty) values (1, 1000,
'apple', 5);
insert into myitems (listid, itemid, idesc, qty) values (1, 1000, 'banana', 10);
insert into myitems (listid, itemid, idesc, qty) values (1, 1000, 'orange', 6);
insert into myitems (listid, itemid, idesc, qty) values (1, 1000, 'orange', 50);

Insert queries in table-2 are exactly the same as table-1. But when I do select * from myitems I'm surprised to see only one row which was inserted last. Rest all rows are lost. ie., Every insert statement overwrote the previous record.

 listid | itemid | idesc  | qty
--------+--------+--------+-----
      1 |   1000 | orange |  50

Question: Why is it behaving differently in table-2 compared to table-1? What is the meaning of clustering key in this context? Why clustering key has been named "Clustering key". Does it have anything to do with cassandra cluster?

Question on update: I tried doing an update on table-1:

update mylistofitems set quantity = 100 where listid = 1 and itemid = 1000;

This says error 2200 some clustering key are missing. Why this is restricted?

Jim Wright Jim Wright · Accepted Answer · 2018-07-05T13:14:45

What is a clustering key?

The clustering keys dictate how data is stored on disk. This is one of the reasons that Cassandra is so performant. Because the order of the columns change how the data is stored, it is important you know they are managed in Cassandra's internals.

Visualise the data on disk as an array. This is effectively how Cassandra will store your data. This is what the first table would look like after the first 3 queries:

table1 =
(listid(1) - itemid(1000)) // Partition key
    idesc('apple') // Clustering key
        = {listid: 1, itemid: 1000, idesc: apple, qty: 5}
    idesc('banana') // Clustering key
        = {listid: 1, itemid: 1000, idesc: banana, qty: 10}
    idesc('orange') // Clustering key
        = {listid: 1, itemid: 1000, idesc: orange, qty: 6}

On the fourth insert it is going to go through the data using each clustering key (or index in this example) to find the final piece of data to overwrite. So after the fourth insert it will look like this:

table1 =
(listid(1) - itemid(1000)) // Partition key
    idesc('apple') // Clustering key
        = {listid: 1, itemid: 1000, idesc: apple, qty: 5}
    idesc('banana') // Clustering key
        = {listid: 1, itemid: 1000, idesc: banana, qty: 10}
    idesc('orange') // Clustering key
        = {listid: 1, itemid: 1000, idesc: orange, qty: 50}

Missing partition/clustering key

Take the following queries and my examples of accessing the data like Cassandra would.

WHERE listid IN (1, 2) and itemid = 1000

result = (data[1-1000], data[2-1000])
WHERE listid = 1 AND itemid = 1000 AND idesc = 'apple'

result = data[1-1000]['apple']
WHERE idesc = 'apple'

result = data[????]['apple']

C* won't know which index to search for apple.

It is important to note that this is the same when inserting or updating data. Let's use your UPDATE query as an example here.

UPDATE mylistofitems SET quantity = 100 WHERE listid = 1 AND itemid = 1000;

With this query you are trying to do this:

`data[1-1000][????] = {listid: 1, itemid: 1000, idesc:????, qty: 1000}`

C* doesn't know which index to store the data in.

You should update your query to the following:

UPDATE mylistofitems SET quantity = 100 WHERE listid = 1 AND itemid = 1000 AND idesc = 'orange';

In array form this would look like:

`data[1-1000]['orange'] = {listid: 1, itemid: 1000, idesc: 'orange', qty: 1000}`

Adding quantity as a clustering key

If you add quantity as a clustering key the data structure would look like this:

table1 =
(listid(1) - itemid(1000)) // Partition key
    idesc('apple') // Clustering key
        quantity(5) // Clustering key
            = {listid: 1, itemid: 1000, idesc: 'apple', qty: 5}
    idesc('banana') // Clustering key
        quantity(10) // Clustering key
            = {listid: 1, itemid: 1000, idesc: 'banana', qty: 10}
    idesc('orange') // Clustering key
        quantity(6) // Clustering key
            = {listid: 1, itemid: 1000, idesc: 'orange', qty: 6}
        quantity(50 // Clustering key
            = {listid: 1, itemid: 1000, idesc: 'orange', qty: 50}

This would allow you to have multiple rows for each combination, although you would not be able to have multiple rows with the same data.

General rules

Partition + clustering keys are the unique fields for each row
You cannot query by a key without including the previous keys in the query
Cassandra does not have inserts/update - only upserts
When inserting a row, you must specify all keys