Ordered model in Cassandra 2.0

Question

I'm pretty new in Cassandra and I am trying to make a model for time series data. My current proposal is this:

CREATE TABLE myproject.variables (
nearest_10_minutes timestamp,
variable_type text,
value double,
variable_timestamp timestamp, 
PRIMARY KEY((variable_type, nearest_10_minutes), variable_timestamp)
)
WITH CLUSTERING ORDER BY (variable_timestamp ASC);

The variable_timestamp is the actual time when the value is sensed. The nearest_10_minutes is the timestamp, but rounded to the nearest 10 minutes. For example if the variable_timestamp is: 19/11/2013 13:13:19.562, the nearest_10_minutes is 19/11/2013 13:10:00.000

I can get rid of the variable_type in the cluster key and put it into a secondary index, but I'm not sure does that aid my case.

The issue is that I'm not really sure how to properly order data. If I take a select * from myproject.variables (just for testing purposes), I get something like (timestamps showed only):

Tue Nov 19 13:19:52 CET 2013
Tue Nov 19 13:19:55 CET 2013
Tue Nov 19 13:40:04 CET 2013
Tue Nov 19 13:40:14 CET 2013
Tue Nov 19 13:40:29 CET 2013
...
Tue Nov 19 13:49:58 CET 2013
Tue Nov 19 13:49:59 CET 2013
...
Tue Nov 19 14:30:00 CET 2013
Tue Nov 19 14:30:01 CET 2013

Now, I'm not really clear should I get some default ordering or not? If I use a select * from myproject.variables order by variable_timestamp asc then I get an error stating I may only use ORDER BY if the partition key is filtered with EQ or IN. And IN can only be used with the second portion of the partition key, not the first one.

All in all, I'm a bit confused, how can I model this in a way I can select and order my data?

------------------------------Answer:------------------------------------

So in a way all the current answers by jorgebg and BryceAtNetwork23 and comments from Mikhail Stepura have shown me the path I consider right. Since I wanted to keep the partitioning as random as possible but in a way that I can predict it so I can have ordered queries and use the IN keyword (so I can put multiple partition keys in the query), I just decided to make a custom partitioning key. The scheme I chose is concatenating variable_type:timestamp_rounded_by_the_hour. That introduces some client leaking of the storage logic, I know, but it is fairly easy to recreate the set of partitioning keys on query in the code.

The answer I chose was the one that contributed the most.

Also, you can have composite partition keys in CQL (without the need to contatenate them) thelastpickle.com/blog/2013/01/11/primary-keys-in-cql.html — jorgebg
True, but then I cannot use the IN keyword with all of the parts of the partition key except the last. — Aleksandar Stojadinovic

jorgebg jorgebg · Accepted Answer · 2014-03-05T14:18:29

The clustering order affects how the rows are stored (and retrieved) within the same partition key, not how all the rows inside the column family are stored (partitioned).

In your case, the rows retrieved from the following query will be ordered by variable_timestamp:

SELECT * FROM variables WHERE variable_type = ? AND nearest_10_minutes = ?;

But if you retrieve multiple partition keys:

SELECT * FROM variables;

It will be ordered by the partition key (the generated token of the partition key) and then by the clustering key.

You can see how rows and columns are stored and partitioned in the blog post CQL3 for Cassandra experts.

Ordered model in Cassandra 2.0

2 Answers