1
votes

Reading the DataStax blog, I was testing the slice query, even if the blog has been done with Cassandra 2 and we are at 3.

So I define a test table as bellow.

CREATE TABLE foo(
  part_key TEXT,
  start_ts INT,
  end_ts INT,
  PRIMARY KEY(part_key, start_ts, end_ts)
) WITH CLUSTERING ORDER BY (start_ts ASC, end_ts ASC);

Some fixtures are inserted in this table.

INSERT INTO foo(part_key, start_ts, end_ts) VALUES ('A', 1, 5);
INSERT INTO foo(part_key, start_ts, end_ts) VALUES ('A', 2, 3);
INSERT INTO foo(part_key, start_ts, end_ts) VALUES ('B', 4, 7);
INSERT INTO foo(part_key, start_ts, end_ts) VALUES ('B', 9, 13);

INSERT INTO foo(part_key, start_ts, end_ts) VALUES ('B', 1, 2);

INSERT INTO foo(part_key, start_ts, end_ts)
VALUES ('B', 9, 9999); -- 9999 = symbolic value for no end

I check the B partition first:

> SELECT * FROM foo WHERE part_key='B';

 part_key | start_ts | end_ts
----------+----------+--------
        B |        1 |      2
        B |        4 |      7
        B |        9 |     13
        B |        9 |   9999

(4 rows)

We cannot execute the "natural" query SELECT * FROM foo WHERE part_key='B' AND start_ts <= 7 AND end_ts >= 7, as it would raise the following error.

InvalidRequest: Error from server: code=2200 [Invalid query] message="Clustering column "end_ts" cannot be restricted (preceding column "start_ts" is restricted by a non-EQ relation)"

In the DataStax blog there is a query as bellow.

SELECT * FROM numberOfRequests
    WHERE cluster = ‘cluster1’
    AND date = ‘2015-06-05’
    AND datacenter = 'US_WEST_COAST'
    AND (hour, minute) >= (12, 30) AND (hour) < (14)

So we try this workaround with the following slice query, using start_ts >= -9999 as always true and allows to specify together the inequality condition for end_ts.

SELECT * FROM foo WHERE part_key='B'
AND (start_ts, end_ts) >= (-9999, 7) AND start_ts <= 7; -- -9999 = min_value

It's executing without any warning and gives the results as bellow.

 part_key | start_ts | end_ts
----------+----------+--------
        B |        1 |      2
        B |        4 |      7

(2 rows)

At this point, the row #1 (with start_ts = 1) should not be there for me as end_ts >= 7 is not true for this row. As start_ts >= -9999 is always true it seems that the first condition (start_ts, end_ts) >= (-9999, 7) is just ignored: the results are the same with the query SELECT * FROM foo WHERE part_key='B' AND start_ts <= 7.

Reading the Blog example I was thinking that (at least with C2.2), (start_ts, end_ts) >= (-9999, 7) AND start_ts <= 7 would mean start_ts >= -9999 AND end_ts >= 7 AND start_ts <= 7, similarily to (start_ts, end_ts) = (4, 7) which means start_ts = 4 AND end_ts = 7 as bellow.

SELECT * FROM foo WHERE part_key='B' AND (start_ts, end_ts) = (4, 7);

 part_key | start_ts | end_ts
----------+----------+--------
        B |        4 |      7

(1 rows)

How such tuple-like inequality is really interpreted? Is there any way to update it so it "works"?

4
That doesn't really answer, or at least indicate that Cassandra results are not consistent. - cchantep

4 Answers

1
votes

What you're hitting here is comparison by tuple - because you've combined your two columns into a tuple, you're getting a slightly different comparison behavior than you expect. It's correct, it's just not what you expect.

(start_ts, end_ts) >= (-9999, 7)

That does not mean start_ts >= -9999 && end_ts >= 7 , it means the left (inclusive) bound is the tuple (-9999,7) . There may be a tuple (-9998,1) which is tuple-wise greater than (-9999,7) , even though end_ts==1 is less than 7

1
votes

Let's insert some more record

INSERT INTO foo (part_key, start_ts , end_ts ) VALUES ( 'B', 1, 7);
INSERT INTO foo (part_key, start_ts , end_ts ) VALUES ( 'B', 4, 8);
INSERT INTO foo (part_key, start_ts , end_ts ) VALUES ( 'B', 9, 7);

Now we have data for part_key = 'B'

cqlsh:test> SELECT * FROM foo WHERE part_key='B' ;

 part_key | start_ts | end_ts
----------+----------+--------
        B |        1 |      2
        B |        1 |      7
        B |        4 |      7
        B |        4 |      8
        B |        9 |      7
        B |        9 |     13
        B |        9 |   9999

Now let's query on this data :

cqlsh:test> SELECT * FROM foo WHERE part_key='B' AND (start_ts, end_ts) >= (1, 4) AND (start_ts, end_ts) <= (9, 7);

 part_key | start_ts | end_ts
----------+----------+--------
        B |        1 |      7
        B |        4 |      7
        B |        4 |      8
        B |        9 |      7

It's seems the result is incorrect. But it's not. Our understanding is incorrect.

Cassandra will store your data sorted by composite field (start_ts, end_ts) first sort by start_ts then for each start_ts sort end_ts. When we query with (start_ts, end_ts) >= (1, 4) AND (start_ts, end_ts) <= (9, 7). Cassandra treat (start_ts, end_ts) as a single composite field and it ranged between value (1, 4) to (9,7)

 part_key | start_ts | end_ts
----------+----------+--------
        B |        1 |      2
-------------------------------> start range
        B |        1 |      7
        B |        4 |      7
        B |        4 |      8
        B |        9 |      7
-------------------------------> end range       
        B |        9 |     13
        B |        9 |   9999

Hope you understand.

-1
votes

One possible approach is to use filtering on your second column. Unfortunately, filtering on clustering columns is only supported since 3.6. So if your version is smaller than that but is a 3.0.x version you can solve your problem by converting your second clustering column into a regular one. In which case the following query will give you the results that you are expecting:

SELECT * FROM foo WHERE part_key='B' AND start_ts = 7 ALLOW FILTERING

Now, I do not know your data and its cardinality so filtering might not be a good option.

Another option would be to change your data model. There are different way to model the things to have a query that fulfill your need in an efficient way.

-2
votes

Its all about ORDER of the tuples not the VALUEs of individual columns. In simple words, it is not about comparing the values but it is about comparing the order of the tuples. Once Cassandra orders the tuples on the basis of your clustering order, it just compares the order of tuples. For example, in the given set's partition B: (start_ts, end_ts) >= (1, 1) AND (start_ts, end_ts) <= (7, 7)

First part of this logical AND: (start_ts, end_ts) >= (1, 1) will include all the tuples below this (1,1) tuple: (1,2), (4,7), (9,13), (9,9999)

Second part of this logical AND:(start_ts, end_ts) <= (7, 7) will include all the tuples above this (7,7) tuple: (1,2), (4,7)

Due to this design, there will always be a need of further checks in the code. For example, if your table has start and end date columns and you have clustered it using StartDate ASC and EndDate ASC; and you are looking for all the rows between a FromDate-ToDate range, you will always have to further filter rows in the code. For this, in the where clause, you will have: (StartDate,EndDate)>=(FromDate, FromDate) AND (StartDate,EndDate)<=(ToDate,ToDate).

This will NOT be equivalent to the SQL condition: StartDate>=FromDate AND EndDate<=ToDate

Hope this helps.