2
votes

Let's say I have a primary key like this: primary key (PK, CK).

Based on what I read (see refs), I think I can loosely describe the way Cassandra uses PK and CK as follows - PK will be used to decide which node(s) the data should go to and CK will be used for clustering (aka ordering) of data within that node.

Then, it seems PK is not used in clustering data within the node and that sounds wrong. What if I have a simple primary with with just PK? Will Cassandra only distribute data across nodes and not order data within each node since there is no clustering column?

refs:

2

2 Answers

2
votes

Then, it seems PK is not used in clustering data within the node and that sounds wrong. What if I have a simple primary with with just PK? Will Cassandra only distribute data across nodes and not order data within each node since there is no clustering column?

Good question. Let's try this out. I'll create a simple table and INSERT some data:

aploetz@cqlsh:stackoverflow> CREATE TABLE programs 
                             (name text PRIMARY KEY, data text);
aploetz@cqlsh:stackoverflow> INSERT INTO programs (name) VALUES ('Tron');
aploetz@cqlsh:stackoverflow> INSERT INTO programs (name) VALUES ('Yori');
aploetz@cqlsh:stackoverflow> INSERT INTO programs (name) VALUES ('Quorra');
aploetz@cqlsh:stackoverflow> INSERT INTO programs (name) VALUES ('Clu');
aploetz@cqlsh:stackoverflow> INSERT INTO programs (name) VALUES ('Flynn');
aploetz@cqlsh:stackoverflow> INSERT INTO programs (name) VALUES ('Zuze');

Now, let's run a query that should answer your question:

aploetz@cqlsh:stackoverflow> SELECT name, token(name) FROM programs;

 name   | system.token(name)
--------+----------------------
  Flynn | -1059892732813900311
   Zuze |  1815531347795840810
   Yori |  2854211700591734382
 Quorra |  3079126743186967718
   Tron |  6359222509420865788
    Clu |  8304850648940574176

(6 rows)

As you can see, they are definitely not in order by name, which is the partition key and lone PRIMARY KEY. But, my query runs the token() function on name, which shows the hashed value of the partition key (name in this case). The results are ordered by that.

So to answer your question, Cassandra orders its partitions by the hashed value of the partition key. Note that this order is maintained throughout the cluster, not just on a single node. Therefore, results for an unbound query (not recommended to be run in a multi-node configuration) will be ordered by the hashed value of the partition key, regardless of the number of nodes in the cluster.

1
votes

Since all data for a table will be written to the same SSTables with a ordering of the partition key. So yes they are sorted.

I think what you're asking is why you can't use a primary key the same way you use a clustering key. For example you can't do less than (<) or greater than (>) on a partition key. Since one node doesn't have all the partition keys this type of query would have to check with all nodes in your cluster to see if they have any partition key that matches your query.