3
votes

We come from RDBMS background ,what we are trying to do is to port an existing data store to cassandra, to leverage the power of distributed database. Our requirement is to store values with respect to a key ,probably key will be time (plan to use epoch time ) and to retrieve the values between the key range

For a Test we have created ColumnFamily and inserted Data using cql (via cqlsh):

CREATE COLUMNFAMILY Log( KEY int PRIMARY KEY,Val1 varchar,Val2 varchar);

INSERT INTO Log (KEY,val1, val2) VALUES (1,'673153106.00','448768737.33'); 
INSERT INTO Log (KEY,val1, val2) VALUES (2,'673153106.50','448768737.67'); 
INSERT INTO Log (KEY,val1, val2) VALUES (3,'673153107.00','448768738.00'); 
INSERT INTO Log (KEY,val1, val2) VALUES (4,'673153107.50','448768738.33'); 
INSERT INTO Log (KEY,val1, val2) VALUES (5,'673153108.00','448768738.67'); 
INSERT INTO Log (KEY,val1, val2) VALUES (6,'673153108.50','448768739.00'); 
INSERT INTO Log (KEY,val1, val2) VALUES (7,'673153109.00','448768739.33'); 
INSERT INTO Log (KEY,val1, val2) VALUES (8,'673153109.50','448768739.67'); 
INSERT INTO Log (KEY,val1, val2) VALUES (9,'673153110.00','448768740.00'); 
INSERT INTO Log (KEY,val1, val2) VALUES (10,'673153110.50','448768740.33');

But our select fails to return the correct data

select * from Log where KEY>4 and KEY<9;

KEY| val1 | val2 | 10 | 673153110.50 | 448768740.33 | 8 | 673153109.50 | 448768739.67 |

select * from Log where KEY>4 and KEY<9;

Bad Request: Start key's md5 sorts after end key's md5. This is not allowed; you probably should not specify end key at all, under RandomPartitioner

Are we doing something wrong?.Is there any solution to select values between a key Range using the randompartition

1

1 Answers

14
votes

There is a good reason why Cassandra forbids this kind of queries. Currently, all your Log entries are distributed across your nodes evenly by using the md5 sum of the primary key. Supporting your query would mean that Cassandra has to query all nodes, retrieve all entries, store them on disk and sort them. And that needs to be done whenever you execute this query.

You could either use a Order-Preserving-Partioner if you want to be able to do this query, but that's not recommended either, because then all your queries will hit a single node if you insert your data sequentially, causing unwanted hot spots.

The usual solution is to use a compound primary key (e.g. index_name + timeuuid). This will ensure that your index is spread across your cluster evenly by using the md5sum of the indexname. But accessing the index (e.g. SELECT * FROM log WHERE index_name = ? AND time >= ? AND time < ?) is still efficient, because the data is already stored in sorted order on the node that is responsible for md5sum(index_name). index_name is usually some key that helps you to partition your data - a user_id or application_id might be a good candidate.

If you think that the index for a single index_name might be to large for a single node then you can adapt the previous schema by adding the current year and month to the index_name. Please read the following two articles for further information: