Do the concepts of wide rows, partitions, clustering columns/keys, and partition keys exist at Cassandra's querying language level?

Question

In Cassandra, do the concepts of wide rows, partitions, clustering columns/keys, and partition keys exist at the querying language level? Or are they internal implementation issues that users of the querying language are not aware of?

Here is an example from How to understand the concept of wide row and related concepts in Cassandra?. In the commands in the query language, the above concepts seem not exist, but under the hook, they do.

Consider a table created with a as partition key and b as clustering column:
Create TABLE test (a text,b int, c text, PRIMARY KEY(a,b)) 
INSERT INTO test(a,b,c) VALUES('test',2,'test2')
INSERT INTO test(a,b,c) VALUES('test',1,'test1')
INSERT INTO test(a,b,c) VALUES('test-new',1,'test1')
If you run the above query in this order cassandra will store data in following order (just check the order of column b):
test -> [b:1,c=test1] [b:2,c=test2]
test-new -> [b:1,c=test1]
pick up the cell with b:1 for partiton key test:
SELECT * from test where a='test' and b=1

Thanks.

partition key and clustering key concept does exist at CQL... wide row is nothig but bad case of choosing bad partittion key.. — undefined_variable
If clustering key is not defined then order by clause will not work in CQL... ORDER BY clause only works on clustering columns.. Similarly WHERE clause is most efficient using partition key — undefined_variable
Thanks. Could you be more specific? (Maybe write an answer?) — Tim
@undefined_variable Thanks. In your example, if two rows have different values of their partition keys, is it correct that they belong to different partitions, and different partitions mean different nodes or data stores? — Tim
yes.. different partition key means data belongs to different partition.. though one node is responsible for many partitions.. so different partition doesn't mean different node — undefined_variable

markc markc · Accepted Answer · 2019-11-29T17:56:18

CQL Schema

Based on your table schema as follows:

Create TABLE test (a text,b int, c text, PRIMARY KEY(a,b))

The partition key is made up of "a" and "b". The following stacoverflow post I think will address all your questions as to what parition keys etc might be: Difference between partition key, composite key and clustering key in Cassandra?

Data files

Partitions and clustering columns etc are all present at the data file level (therefore at the DB). Internally this is understood by Cassandras storage engine. Using your example I created the table, flushed the keyspace and inspected the sstable using sstablemetadata

Note you do have to run the tool as the same user that Cassandra is running as (in my case it is the cassandra user:

$ sudo -u cassandra sstabledump /var/lib/cassandra/data/mc/test-bedc4ba012cf11ea93f72f6848f9d70d/md-1-big-Data.db

[
  {
    "partition" : {
      "key" : [ "test" ],
      "position" : 0
    },
    "rows" : [
      {
        "type" : "row",
        "position" : 37,
        "clustering" : [ 1 ],
        "liveness_info" : { "tstamp" : "2019-11-29T17:43:35.752796Z" },
        "cells" : [
          { "name" : "c", "value" : "test1" }
        ]
      },
      {
        "type" : "row",
        "position" : 37,
        "clustering" : [ 2 ],
        "liveness_info" : { "tstamp" : "2019-11-29T17:43:31.144961Z" },
        "cells" : [
          { "name" : "c", "value" : "test2" }
        ]
      }
    ]
  },
  {
    "partition" : {
      "key" : [ "test-new" ],
      "position" : 54
    },
    "rows" : [
      {
        "type" : "row",
        "position" : 95,
        "clustering" : [ 1 ],
        "liveness_info" : { "tstamp" : "2019-11-29T17:43:41.438779Z" },
        "cells" : [
          { "name" : "c", "value" : "test1" }
        ]
      }
    ]
  }
]

We can clearly see that the key "test" has two clustering rows of values "1" and "2" respectively.

For a bit more background information on the Storage engine see: https://docs.datastax.com/en/cassandra/3.0/cassandra/dml/dmlManageOndisk.html

Wide rows

This is not so much something you decide to use or implement, rather it is a side effect of a bad data model. A good example is imagine you had a table like so:

CREATE TABLE mc.cars (
    owner_id int PRIMARY KEY,
    car_reg text,
    owner_name text,
    price float,
    purchased date
);

While this model might be ok, imagine you then had a (lucky!) owner who had over 1000 cars in their collection. Aside from a large garage, they might also be the cause of a wide row. If however your table looked something like this:

CREATE TABLE mc.cars2 (
    owner_id int,
    car_reg text,
    owner_name text,
    price float,
    purchased date,
    PRIMARY KEY (owner_id, car_reg)
) WITH CLUSTERING ORDER BY (car_reg ASC)

You will be less likely to see a wide row as your partition key is made up of the car reg number too.

Do the concepts of wide rows, partitions, clustering columns/keys, and partition keys exist at Cassandra's querying language level?

2 Answers

CQL Schema

Data files

Wide rows