0
votes

Table-1:

create table mylistofitems (listid int, 
  itemid int, 
  quantity int, 
  itemdesc text, 
  primary key ((listid, itemid), itemdesc));

In the above table I'm doing the following inserts:

insert into mylistofitems (listid, itemid, itemdesc, quantity) values (1, 1000, 'apple', 5);
insert into mylistofitems (listid, itemid, itemdesc, quantity) values (1, 1000, 'banana', 10);
insert into mylistofitems (listid, itemid, itemdesc, quantity) values (1, 1000, 'orange', 6);
insert into mylistofitems (listid, itemid, itemdesc, quantity) values (1, 1000, 'orange', 50);

when I do select * from mylistofitems I get the following:

 listid | itemid | itemdesc | quantity
--------+--------+----------+----------
      1 |   1000 |    apple |        5
      1 |   1000 |   banana |       10
      1 |   1000 |   orange |       50

Second insert statement did NOT overwrite the 1st row. But the fourth insert statement has overwritten the third row.

In this context what is the meaning of clustering key?

Table-2:

create table myitems (listid int,
  itemid int, 
  idesc text, 
  qty int,
  primary key (listid, itemid));

I insert the following records into table-2:

insert into myitems (listid, itemid, idesc, qty) values (1, 1000,
'apple', 5);
insert into myitems (listid, itemid, idesc, qty) values (1, 1000, 'banana', 10);
insert into myitems (listid, itemid, idesc, qty) values (1, 1000, 'orange', 6);
insert into myitems (listid, itemid, idesc, qty) values (1, 1000, 'orange', 50);

Insert queries in table-2 are exactly the same as table-1. But when I do select * from myitems I'm surprised to see only one row which was inserted last. Rest all rows are lost. ie., Every insert statement overwrote the previous record.

 listid | itemid | idesc  | qty
--------+--------+--------+-----
      1 |   1000 | orange |  50

Question: Why is it behaving differently in table-2 compared to table-1? What is the meaning of clustering key in this context? Why clustering key has been named "Clustering key". Does it have anything to do with cassandra cluster?

Question on update: I tried doing an update on table-1:

update mylistofitems set quantity = 100 where listid = 1 and itemid = 1000;

This says error 2200 some clustering key are missing. Why this is restricted?

2

2 Answers

2
votes

What is a clustering key?

The clustering keys dictate how data is stored on disk. This is one of the reasons that Cassandra is so performant. Because the order of the columns change how the data is stored, it is important you know they are managed in Cassandra's internals.

Visualise the data on disk as an array. This is effectively how Cassandra will store your data. This is what the first table would look like after the first 3 queries:

table1 =
(listid(1) - itemid(1000)) // Partition key
    idesc('apple') // Clustering key
        = {listid: 1, itemid: 1000, idesc: apple, qty: 5}
    idesc('banana') // Clustering key
        = {listid: 1, itemid: 1000, idesc: banana, qty: 10}
    idesc('orange') // Clustering key
        = {listid: 1, itemid: 1000, idesc: orange, qty: 6}

On the fourth insert it is going to go through the data using each clustering key (or index in this example) to find the final piece of data to overwrite. So after the fourth insert it will look like this:

table1 =
(listid(1) - itemid(1000)) // Partition key
    idesc('apple') // Clustering key
        = {listid: 1, itemid: 1000, idesc: apple, qty: 5}
    idesc('banana') // Clustering key
        = {listid: 1, itemid: 1000, idesc: banana, qty: 10}
    idesc('orange') // Clustering key
        = {listid: 1, itemid: 1000, idesc: orange, qty: 50}

Missing partition/clustering key

Take the following queries and my examples of accessing the data like Cassandra would.

  1. WHERE listid IN (1, 2) and itemid = 1000

    result = (data[1-1000], data[2-1000])

  2. WHERE listid = 1 AND itemid = 1000 AND idesc = 'apple'

    result = data[1-1000]['apple']

  3. WHERE idesc = 'apple'

    result = data[????]['apple']

    C* won't know which index to search for apple.

It is important to note that this is the same when inserting or updating data. Let's use your UPDATE query as an example here.

UPDATE mylistofitems SET quantity = 100 WHERE listid = 1 AND itemid = 1000;

With this query you are trying to do this:

`data[1-1000][????] = {listid: 1, itemid: 1000, idesc:????, qty: 1000}`

C* doesn't know which index to store the data in.

You should update your query to the following:

UPDATE mylistofitems SET quantity = 100 WHERE listid = 1 AND itemid = 1000 AND idesc = 'orange';

In array form this would look like:

`data[1-1000]['orange'] = {listid: 1, itemid: 1000, idesc: 'orange', qty: 1000}`

Adding quantity as a clustering key

If you add quantity as a clustering key the data structure would look like this:

table1 =
(listid(1) - itemid(1000)) // Partition key
    idesc('apple') // Clustering key
        quantity(5) // Clustering key
            = {listid: 1, itemid: 1000, idesc: 'apple', qty: 5}
    idesc('banana') // Clustering key
        quantity(10) // Clustering key
            = {listid: 1, itemid: 1000, idesc: 'banana', qty: 10}
    idesc('orange') // Clustering key
        quantity(6) // Clustering key
            = {listid: 1, itemid: 1000, idesc: 'orange', qty: 6}
        quantity(50 // Clustering key
            = {listid: 1, itemid: 1000, idesc: 'orange', qty: 50}

This would allow you to have multiple rows for each combination, although you would not be able to have multiple rows with the same data.

General rules

  1. Partition + clustering keys are the unique fields for each row
  2. You cannot query by a key without including the previous keys in the query
  3. Cassandra does not have inserts/update - only upserts
  4. When inserting a row, you must specify all keys
1
votes

I would like to answer my own question to close this thread. Also to help others who would have same confusions:

Basically I was over looking the concept of primary key, partition key, clustering key.

Primary key of table-1 is:

primary key ((listid, itemid), itemdesc));

This means listid+itemid is just a partition key helping the record to find it's node to land on to.

The actual uniqueness will be maintained only for the combination of listid+itemid+itemdesc.

In summary:

listid+itemid = composite partition key
listid+itemid+itemdesc = composite primary key
itemdesc = clustering key 

(clustering key is used just to sort the items under each partition which helps querying by this column with various relational operators)

Clustering in this context is nothing but grouping the records by partition key and ordering them in ASC (by default) under each partition key. In other words it's a group by and order by.

This is quite different from RDBMS. In RDBMS world you use group by and order by at the time of retrieval as you want. In Cassandra we use grouping and ordering at the time of insertion itself so that your retrieval is faster (subject to the queries used)

Table-2 is defined as primary key (listid, itemid) which means:

listid = standalone partition key
itemid = standalone clustering key
listid + itemid = composite primary key