9
votes

So, my original problem was using the token() function to page through a large data set in Cassandra 1.2.9, as explained and answered here: Paging large resultsets in Cassandra with CQL3 with varchar keys

The accepted answer got the select working with tokens and chunk size, but another problem manifested itself.

My table looks like this in cqlsh:

key           | column1               | value
---------------+-----------------------+-------
  85.166.4.140 |       county_finnmark |     4
  85.166.4.140 |       county_id_20020 |     4
  85.166.4.140 |     municipality_alta |     2
  85.166.4.140 | municipality_id_20441 |     2
 93.89.124.241 |        county_hedmark |    24
 93.89.124.241 |       county_id_20005 |    24

The primary key is a composite of key and column1. In CLI, the same data looks like this:

get ip['85.166.4.140'];
=> (counter=county_finnmark, value=4)
=> (counter=county_id_20020, value=4)
=> (counter=municipality_alta, value=2)
=> (counter=municipality_id_20441, value=2)
Returned 4 results.

The problem

When using cql with a limit of i.e. 100, the returned results may stop in the middle of a record, like this:

key           | column1               | value
---------------+-----------------------+-------
  85.166.4.140 |       county_finnmark |     4
  85.166.4.140 |       county_id_20020 |     4

leaving these to "rows" (columns) out:

  85.166.4.140 |     municipality_alta |     2
  85.166.4.140 | municipality_id_20441 |     2

Now, when I use the token() function for the next page like, these two rows are skipped:

select * from ip where token(key) > token('85.166.4.140') limit 10;

Result:

key           | column1                | value
---------------+------------------------+-------
 93.89.124.241 |         county_hedmark |    24
 93.89.124.241 |        county_id_20005 |    24
 95.169.53.204 |        county_id_20006 |     2
 95.169.53.204 |         county_oppland |     2

So, no trace of the last two results from the previous IP address.

Question

How can I use token() for paging without skipping over cql rows? Something like:

select * from ip where token(key) > token(key:column1) limit 10;
1

1 Answers

6
votes

Ok, so I used the info in this post to work out a solution: http://www.datastax.com/dev/blog/cql3-table-support-in-hadoop-pig-and-hive (section "CQL3 pagination").

First, I execute this cql:

select * from ip limit 5000;

From the last row in the resultset, I get the key (i.e. '85.166.4.140') and the value from column1 (i.e. 'county_id_20020').

Then I create a prepared statement evaluating to

select * from ip where token(key) = token('85.166.4.140') and column1 > 'county_id_20020' ALLOW FILTERING;

(I'm guessing it would work also without using the token() function, as the check is now for equal:)

select * from ip where key = '85.166.4.140' and column1 > 'county_id_20020' ALLOW FILTERING;

The resultset now contains the remaining X rows (columns) for this IP. The method then returns all the rows, and the next call to the method includes the last used key ('85.166.4.140'). With this key, I can execute the following select:

select * from ip where token(key) > token('85.166.4.140') limit 5000;

which gives me the next 5000 rows from (and including) the first IP after '85.166.4.140'.

Now, no columns are lost in the paging.

UPDATE

Cassandra 2.0 introduced automatic paging, handled by the client. More info here: http://www.datastax.com/dev/blog/client-side-improvements-in-cassandra-2-0

(note that setFetchSize is optional and not necessary for paging to work)