2
votes

I'm pretty new in Cassandra and I am trying to make a model for time series data. My current proposal is this:

CREATE TABLE myproject.variables (
nearest_10_minutes timestamp,
variable_type text,
value double,
variable_timestamp timestamp, 
PRIMARY KEY((variable_type, nearest_10_minutes), variable_timestamp)
)
WITH CLUSTERING ORDER BY (variable_timestamp ASC);

The variable_timestamp is the actual time when the value is sensed. The nearest_10_minutes is the timestamp, but rounded to the nearest 10 minutes. For example if the variable_timestamp is: 19/11/2013 13:13:19.562, the nearest_10_minutes is 19/11/2013 13:10:00.000

I can get rid of the variable_type in the cluster key and put it into a secondary index, but I'm not sure does that aid my case.

The issue is that I'm not really sure how to properly order data. If I take a select * from myproject.variables (just for testing purposes), I get something like (timestamps showed only):

Tue Nov 19 13:19:52 CET 2013
Tue Nov 19 13:19:55 CET 2013
Tue Nov 19 13:40:04 CET 2013
Tue Nov 19 13:40:14 CET 2013
Tue Nov 19 13:40:29 CET 2013
...
Tue Nov 19 13:49:58 CET 2013
Tue Nov 19 13:49:59 CET 2013
...
Tue Nov 19 14:30:00 CET 2013
Tue Nov 19 14:30:01 CET 2013

Now, I'm not really clear should I get some default ordering or not? If I use a select * from myproject.variables order by variable_timestamp asc then I get an error stating I may only use ORDER BY if the partition key is filtered with EQ or IN. And IN can only be used with the second portion of the partition key, not the first one.

All in all, I'm a bit confused, how can I model this in a way I can select and order my data?

------------------------------Answer:------------------------------------

So in a way all the current answers by jorgebg and BryceAtNetwork23 and comments from Mikhail Stepura have shown me the path I consider right. Since I wanted to keep the partitioning as random as possible but in a way that I can predict it so I can have ordered queries and use the IN keyword (so I can put multiple partition keys in the query), I just decided to make a custom partitioning key. The scheme I chose is concatenating variable_type:timestamp_rounded_by_the_hour. That introduces some client leaking of the storage logic, I know, but it is fairly easy to recreate the set of partitioning keys on query in the code.

The answer I chose was the one that contributed the most.

2
Also, you can have composite partition keys in CQL (without the need to contatenate them) thelastpickle.com/blog/2013/01/11/primary-keys-in-cql.html - jorgebg
True, but then I cannot use the IN keyword with all of the parts of the partition key except the last. - Aleksandar Stojadinovic

2 Answers

2
votes

The clustering order affects how the rows are stored (and retrieved) within the same partition key, not how all the rows inside the column family are stored (partitioned).

In your case, the rows retrieved from the following query will be ordered by variable_timestamp:

SELECT * FROM variables WHERE variable_type = ? AND nearest_10_minutes = ?;

But if you retrieve multiple partition keys:

SELECT * FROM variables;

It will be ordered by the partition key (the generated token of the partition key) and then by the clustering key.

You can see how rows and columns are stored and partitioned in the blog post CQL3 for Cassandra experts.

2
votes
PRIMARY KEY((variable_type, nearest_10_minutes), variable_timestamp)

The best values for a PRIMARY/Partitioning key are ones that are unique. How unique is variable_type? Just based on what you've said here (without really knowing your data), it sounds like value might make a better candidate.

WITH CLUSTERING ORDER BY (variable_timestamp ASC);

"Now, I'm not really clear should I get some default ordering or not?"

Unless I missed something, it looks like your data is coming back in ascending order. Check through DataStax's documentation on clustering order. With the way you have your PRIMARY KEY defined right now, it will enforce a clustering order on variable_timestamp, ascending. You should only need to specify CLUSTERING ORDER if you need that order reversed. Note that your clustering key(s) indicate the order in which the data is stored on disk.

"I get an error stating I may only use ORDER BY if the partition key is filtered with EQ or IN."

Right, because CQL won't just let you specify any columns in your WHERE or ORDER BY. If you want to do anything with the clustering key(s) (in the ORDER BY or WHERE), you need to also specify the partition key(s) (in the WHERE clause, in this case).

Try creating your PRIMARY KEY like this:

PRIMARY KEY(value, variable_timestamp)

Or if after evaluating your model variable_type still makes sense:

PRIMARY KEY(variable_type, variable_timestamp)

And don't add the CLUSTERING ORDER clause.

For more help on modeling time series data, check out DataStax Academy's (free) course called "Java Development With Apache Cassandra". Session 5, Module 104 discusses modeling a simple time series. See if that helps.