1
votes

In Cassandra, do the concepts of wide rows, partitions, clustering columns/keys, and partition keys exist at the querying language level? Or are they internal implementation issues that users of the querying language are not aware of?

Here is an example from How to understand the concept of wide row and related concepts in Cassandra?. In the commands in the query language, the above concepts seem not exist, but under the hook, they do.

Consider a table created with a as partition key and b as clustering column:

Create TABLE test (a text,b int, c text, PRIMARY KEY(a,b)) 
INSERT INTO test(a,b,c) VALUES('test',2,'test2')
INSERT INTO test(a,b,c) VALUES('test',1,'test1')
INSERT INTO test(a,b,c) VALUES('test-new',1,'test1')

If you run the above query in this order cassandra will store data in following order (just check the order of column b):

test -> [b:1,c=test1] [b:2,c=test2]
test-new -> [b:1,c=test1]

pick up the cell with b:1 for partiton key test:

SELECT * from test where a='test' and b=1

Thanks.

2
partition key and clustering key concept does exist at CQL... wide row is nothig but bad case of choosing bad partittion key..undefined_variable
If clustering key is not defined then order by clause will not work in CQL... ORDER BY clause only works on clustering columns.. Similarly WHERE clause is most efficient using partition keyundefined_variable
Thanks. Could you be more specific? (Maybe write an answer?)Tim
@undefined_variable Thanks. In your example, if two rows have different values of their partition keys, is it correct that they belong to different partitions, and different partitions mean different nodes or data stores?Tim
yes.. different partition key means data belongs to different partition.. though one node is responsible for many partitions.. so different partition doesn't mean different nodeundefined_variable

2 Answers

2
votes

CQL Schema

Based on your table schema as follows:

Create TABLE test (a text,b int, c text, PRIMARY KEY(a,b)) 

The partition key is made up of "a" and "b". The following stacoverflow post I think will address all your questions as to what parition keys etc might be: Difference between partition key, composite key and clustering key in Cassandra?

Data files

Partitions and clustering columns etc are all present at the data file level (therefore at the DB). Internally this is understood by Cassandras storage engine. Using your example I created the table, flushed the keyspace and inspected the sstable using sstablemetadata

Note you do have to run the tool as the same user that Cassandra is running as (in my case it is the cassandra user:

$ sudo -u cassandra sstabledump /var/lib/cassandra/data/mc/test-bedc4ba012cf11ea93f72f6848f9d70d/md-1-big-Data.db

[
  {
    "partition" : {
      "key" : [ "test" ],
      "position" : 0
    },
    "rows" : [
      {
        "type" : "row",
        "position" : 37,
        "clustering" : [ 1 ],
        "liveness_info" : { "tstamp" : "2019-11-29T17:43:35.752796Z" },
        "cells" : [
          { "name" : "c", "value" : "test1" }
        ]
      },
      {
        "type" : "row",
        "position" : 37,
        "clustering" : [ 2 ],
        "liveness_info" : { "tstamp" : "2019-11-29T17:43:31.144961Z" },
        "cells" : [
          { "name" : "c", "value" : "test2" }
        ]
      }
    ]
  },
  {
    "partition" : {
      "key" : [ "test-new" ],
      "position" : 54
    },
    "rows" : [
      {
        "type" : "row",
        "position" : 95,
        "clustering" : [ 1 ],
        "liveness_info" : { "tstamp" : "2019-11-29T17:43:41.438779Z" },
        "cells" : [
          { "name" : "c", "value" : "test1" }
        ]
      }
    ]
  }
]

We can clearly see that the key "test" has two clustering rows of values "1" and "2" respectively.

For a bit more background information on the Storage engine see: https://docs.datastax.com/en/cassandra/3.0/cassandra/dml/dmlManageOndisk.html

Wide rows

This is not so much something you decide to use or implement, rather it is a side effect of a bad data model. A good example is imagine you had a table like so:

CREATE TABLE mc.cars (
    owner_id int PRIMARY KEY,
    car_reg text,
    owner_name text,
    price float,
    purchased date
);

While this model might be ok, imagine you then had a (lucky!) owner who had over 1000 cars in their collection. Aside from a large garage, they might also be the cause of a wide row. If however your table looked something like this:

CREATE TABLE mc.cars2 (
    owner_id int,
    car_reg text,
    owner_name text,
    price float,
    purchased date,
    PRIMARY KEY (owner_id, car_reg)
) WITH CLUSTERING ORDER BY (car_reg ASC)

You will be less likely to see a wide row as your partition key is made up of the car reg number too.

1
votes

Definitely - CQL syntax does have a notion of partition keys vs clustering keys. Just look at the example you provided:

Create TABLE test (a text,b int, c text, PRIMARY KEY(a,b)) 

The syntax (a,b) means, in CQL, that a is a partition key and b is a clustering key. As another example, if you were to write ((a,b,c),d,e,f) this would mean that a,b, and c are partition key columns, while d, e and f are clustering key columns. This is CQL syntax.

What this means in practice, I assume you know. Among other things, you can ask to get all the clustering rows belonging to a single partition in some known sort order - but partitions are not sorted and a full-table scan returns them in random order.

The term "wide row" is not used in CQL as a term, but the concept definitely exists, as I explained above - a "wide row" (actually, "wide partition" is more accurate) is what happens when a single partition has a lot of clustering rows - i.e., a lot of different clustering keys for the same partition key. Wide rows are supported decently in Cassandra, to a limit (reading from really huge partitions can be slower, and various pieces of the code still handle them in an inefficient manner). Some documents like this suggest that Cassandra partitions should ideally be up to 10MB in size.