8
votes

EDIT1: added a case to describe the problem after the original question.

I wish to query on a column which is not part of my key. If I understand correctly, I need to define a secondary index on that column. However, I wish to use a greater than condition (not just equality condition) and that still seems unsupported.

Am I missing something? How would you address this issue?

My desired Setup:

Cassandra 1.1.6
CQL3

CREATE TABLE Table1(
             KeyA int,
             KeyB int,
             ValueA int,
             PRIMARY KEY (KeyA, KeyB)
           );

CREATE INDEX ON Table1 (ValueA);

SELECT * FROM Table1 WHERE ValueA > 3000;

Since defining a secondary index on ColumnFamilies with Composite Keys is still not supported in Cassandra 1.1.6 I have to settle on a temporary solution of dropping one of the keys but I still have the same problem with non equality conditions.

Is there another way to address this?

Thank you for your time.

Relevant sources: http://cassandra.apache.org/doc/cql3/CQL.html#selectStmt http://www.datastax.com/docs/1.1/ddl/indexes


EDIT1

Here's a case that will explain the problem. As rs-atl noted, it might be a data model problem. Let's say I keep a column family of all the users on stackoverflow. for each user I keep a batch of stats (Reputation, NumOfAnswers, NumOfVotes... all of them are int). I want to query on those stats to get the relevant users.

CREATE TABLE UserStats(
             UserID int,
             Reputation int,
             NumOfAnswers int,
             .
             .
             .
             A lot of stats...
             .
             .
             .
             NumOfVotes int,
             PRIMARY KEY (UserID)
           );

Now I'm interested in slicing UserID's based on those stats. I want all the users with over 10K reputation, I want all the users with less than 5 answers, etc. etc.

I hope that helps. Thanks again.

3
You are correct that you must have an equality condition. There are numerous posts that have been made discussing the reasons for this. I'd be happy to help with a correct data model if you'd post the actual problem. As has been said many times before but bears repeating, with Cassandra you must model your data to answer your queries. - rs_atl
First of all, thank you for agreeing to help. I'll edit my post and describe a simple case that fits the problem. If you are interested in my full model design you can find it here: stackoverflow.com/questions/13131254/… The relevant column family for this question is TestsData. - Oren
Do you know in advance what columns you'll need and/or what the range of values may be, or are these dynamic? - rs_atl
@rs_atl: I have about 20 or so stats. I'll have to be able to slice users on all of them. (BTW: when you said dynamic, did you mean the columns? I don't expect many new stats, this column family is quite static) - Oren

3 Answers

10
votes

In CQL, you are able to apply the WHERE clause on all columns once you have created indices for them (i.e., secondary index). Otherwise, you will get the following error:

Bad Request: No indexed columns present in by-columns clause with Equal operator

Unfortunately, even with secondary indices, the WHERE clause are required to have at least one EQ on an secondary index by CQL due to performance issue.

Q: Why is it necessary to always have at least one EQ comparison on secondary indices?

A: Inequalities on secondary indices are always done in memory, so without at least one EQ on another secondary index you will be loading every row in the database, which with a massive database isn't a good idea. So by requiring at least one EQ on an (secondary) index, you hopefully limit the set of rows that need to be read into memory to a manageable size. (Although obviously you can still get into trouble with that as well).

So basically if you have anything besides an EQ comparison, it loads all rows "that elsewise match" your query, and checks if they match, one at a time. Which is not allowed by default since it "could be slow." (In essence, indexes only index "for equality" not for anything else like < and > which indexes on a relational database would).

One thing to note is that if you have more than one non EQ conditions on secondary indices, you also need to include the ALLOW FILTERING key word in your query, or else you'll get

Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING

One simple way to work-around is to append a dummy column to your table where all row have the same value on that column. So in this case you are able to perform ranged query on just your desired column. Do realize that these kind of queries on a NoSQL database may be slow/bog down a system.


Example

cqlsh:demo> desc table table1;

CREATE TABLE table1 (
  keya int,
  keyb int,
  dummyvalue int,
  valuea int,
  PRIMARY KEY (keya, keyb)
) ....

cqlsh:demo> select * from Table1;

 keya | keyb | dummyvalue | valuea
------+------+------------+--------
    1 |    2 |          0 |      3
    4 |    5 |          0 |      6
    7 |    8 |          0 |      9

Create secondary indices on ValueA and DummyValue:

cqlsh:demo> create index table1_valuea on table1 (valuea);
cqlsh:demo> create index table1_valueb on table1 (dummyvalue);

Perform ranged query on ValueA with DummyValue=0:

cqlsh:demo> select * from table1 where dummyvalue = 0 and valuea > 5 allow filtering;

 keya | keyb | dummyvalue | valuea
------+------+------------+--------
    4 |    5 |          0 |      6
    7 |    8 |          0 |      9
1
votes

Probably the most flexible way to deal with this scenario in Cassandra will be to have a separate CF for each stat, with sentinel values as keys and the stat value in the column name, like this:

CF: StatName {
  Key: SomeSentinelValue {
    [Value]:[UserID] = ""
  }
}

So let's say your stat is NumAnswers and your user IDs are strings:

CF: NumAnswers {
  Key: 0 {
    150:Joe = ""
    200:Bob = ""
    500:Sue = ""
  }
  Key: 1000 {
    1020:George = ""
    1300:Ringo = ""
    1300:Mary = ""
  }
}

So you can see that your keys are essentially buckets of values, which can be as coarse or fine grain as needed for your data, and your columns are composites of value + user ID. You can now hand Cassandra a known key (or set of keys) for the coarse range you need (the equality), then do a range query on the first component of the column name. Note that you cannot write the user ID as value, because this would prevent two users from having the same count.

0
votes

PRIMARY KEY (KeyA, KeyB) );

CREATE INDEX ON Table1 (ValueA);

SELECT * FROM Table1 WHERE ValueA > 3000;

The Cassandra way is to have some partition key and always use that, with a clustering column for ValueA possibly PRIMARY KEY ((KeyA, KeyB), ValueA) and then use like:

select * from Table1 where KeyA='xx' and ValueA > 3000