2
votes

Using CQL3, how does one enumerate all the partition keys of a table in Cassandra? In particular there are complications with returning distinct keys, and paginating the results.

3
Check this blogpost by Richard, it goes into great detail why it's hard to count keys in a distributed system. - Lyuben Todorov
@LyubenTodorov: I'm aware of the difficulties. However I'm just after the keys, I don't care for count or consistency. - Matt Joiner

3 Answers

0
votes

With a little pre-knowledge about the possible values of your keys, I think this could be done using with the help of the token function. Take a look at this answer. Is that what you are looking for?

Also, native pagination seems to be an upcoming feature for 2.0. It's in the latest beta.

Until 2.0 arrives, you can see this work-around for pagination on the datastax blog (go the "CQL3 pagination" section). This is, in principle, much the same as the link I posted above but goes into great detail how to implement pagination taking column keys into account etc.

5
votes

You can do it as in the following example. Create a test table:

> create table partition_keys_test (p_key text PRIMARY KEY, rest text);

and insert some rows e.g.:

> insert into partition_keys_test (p_key, rest) VALUES ('1', 'blah');

I did this for p_key '1', '2', ..., '9'.

Then page through the partition keys. Start with:

> select p_key from partition_keys_test limit 2;
 p_key
 -------
     6
     7

for page size 2. Then, take your last p_key result and use it in the next query:

> select p_key from partition_keys_test where token(p_key) > token('7') limit 2;
 p_key
 -------
    9
    4

and so on, until you receive less than your page size results.

Note that you should expect this to read through your entire data set. For very wide rows it may not, but will still be very I/O heavy.

Also, if rows are created or removed and have tokens higher than you've got to so far, they will appear in subsequent queries. So if you are running the above paging queries while you are creating or removing rows, the partition keys may or may not be returned, depending on timing.

4
votes

The bad news is that for now (August 2013) you have to select the entire primary key, not just the partition key, to paginate through them. With a compound PK, this may involve a lot of duplicate partition keys.

The good news is that https://issues.apache.org/jira/browse/CASSANDRA-4536 is open to allow SELECT DISTINCT for the special case of partition keys in 2.0.1, since it's possible to retrieve the unique partition keys efficiently under the hood; CQL just doesn't have a good way to express that until then.