get_range_slices and CQL query handling, need for ALLOW FILTERING

Question

I have a following CQL table (a bit simplified for clarity):

CREATE TABLE test_table (
    user        uuid,
    app_id      ascii,
    domain_id   ascii,
    props       map<ascii,blob>,
    PRIMARY KEY ((user), app_id, domain_id)
)

The idea is that this table would contain many users (i.e. rows, say, dozens of millions). For each user there would be a few domains of interest and there would be a few apps per domain. And for each user/domain/app there would be a small set of properties.

I need to scan this entire table and load its contents in chunks for given app_id and domain_id. My idea was to use the TOKEN function to be able to read the whole data set in several iterations. So, something like this:

SELECT props FROM test_table WHERE app_id='myapp1'
  AND domain_id='mydomain1'
  AND TOKEN(user) > -9223372036854775808
  AND TOKEN(user) < 9223372036854775807;

I was assuming that this query would be efficient because I specify the range of the row keys and by specifying the values of the clustering keys I effectively specify the column range. But when I try to run this query I get the error message "Bad Request: Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING".

I have limited experience with Cassandra and I assumed that this sort of query would map into get_range_slices() call, which accepts the slice predicate (i.e. the range of columns defined by my app_id/domain_id values) and the key range defined by my token range. It seems either I misunderstand how this sort of query is handled or maybe I misunderstand about the efficiency of get_range_slices() call.

To be more specific, my questions are: - if this data model does make sense for the kind of query I have in mind - if this query is expected to be efficient - if it is efficient, then why am I getting this error message asking me to ALLOW FILTERING

My only guess about the last one was that the rows that do not have the given combination of app_id/domain_id would need to be skipped from the result.

--- update ----

Thank for all the comments. I have been doing more research on this and there is still something that I do not fully understand.

In the given structure what I am trying to get is like a rectangular area from my data set (assuming that all rows have the same columns). Where top and the bottom of the rectangle is determined by the token range (range) and the left/right sides are defined as column range (slice). So, this should naturally transform into get_range_slices request. My understanding (correct me if I am wrong) that the reason why CQL requires me to put ALLOW FILTERING clause is because there will be rows that do not contain the columns I am looking for, so they will have to be skipped. And since nobody knows if it will have to skip every second row or first million rows before finding one that fits my criteria (in the given range) - this is what causes the unpredictable latency and possibly even timeout. Am I right? I have tried to write a test that does the same kind of query but using low-level Astyanax API (over the same table, I had to read the data generated with CQL, it turned out to be quite simple) and this test does work - except that it returns keys with no columns where the row does not contain the slice of columns I am asking for. Of course I had to implement some kind of simple paging based on the starting token and limit to fetch the data in small chunks.

Now I am wondering - again, considering that I would need to deal with dozens of millions of users: would it be better to partially "rotate" this table and organize it in something like this:

Row key: domain_id + app_id + partition no (something like hash(user) mod X) Clustering key: column partition no (something like hash(user) >> 16 mod Y) + user

For the "column partition no"...I am not sure if it is really needed. I assume that if I go with this model I will have relatively small number of rows (X=1000..10000) for each domain + app combination. This will allow me to query the individual partitions, even in parallel if I want to. But (assuming the user is random UUID) for 100M users it will result in dozens or hundreds of thousands of columns per row. Is it a good idea to read one such a row in one request? It should created some memory pressure for Cassandra, I am sure. So maybe reading them in groups (say, Y=10..100) would be better?

I realize that what I am trying to do is not what Cassandra does well - reading "all" or large subset of CF data in chunks that can be pre-calculated (like token range or partition keys) for parallel fetching from different hosts. But I am trying to find a pattern that is the most efficient for such a use case.

By the way, the query like "select * from ... where TOKEN(user)>X and TOKEN(user)

Wildfire Wildfire · Accepted Answer · 2013-10-17T00:05:00

Short answer

This warning means that Cassandra would have to read non-indexed data and filter out the rows that don't satisfy the criteria. If you add ALLOW FILTERING to the end of query, it will work, however it will scan a lot of data:

SELECT props FROM test_table 
WHERE app_id='myapp1' 
AND domain_id='mydomain1' 
AND TOKEN(user) > -9223372036854775808 
AND TOKEN(user) < 9223372036854775807
ALLOW FILTERING;

Longer explanation

In your example primary key consists of two parts: user is used as partition key, and <app_id, domain_id> form remaining part. Rows for different users are distributed across the cluster, each node responsible for specific range of token ring.

Rows on a single node are sorted by the hash of partition key (token(user) in your example). Different rows for single user are stored on a single node, sorted by <app_id, domain_id> tuple.

So, the primary key forms a tree-like structure. Partition key adds one level of hierarchy, and each remaining field of a primary key adds another one. By default, Cassandra processes only the queries that return all rows from the continuos range of the tree (or several ranges if you use key in (...) construct). If Cassandra should filter out some rows, ALLOW FILTERING must be specified.

Example queries that don't require ALLOW FILTERING:

SELECT * FROM test_table 
WHERE user = 'user1'; 
//OK, returns all rows for a single partition key

SELECT * FROM test_table 
WHERE TOKEN(user) > -9223372036854775808 
AND TOKEN(user) < 9223372036854775807; 
//OK, returns all rows for a continuos range of the token ring

SELECT * FROM test_table 
WHERE user = 'user1'
AND app_id='myapp1'; 
//OK, the rows for specific user/app combination 
//are stored together, sorted by domain_id field

SELECT * FROM test_table 
WHERE user = 'user1'
AND app_id > 'abc' AND app_id < 'xyz'; 
//OK, since rows for a single user are sorted by app

Example queries that do require ALLOW FILTERING:

SELECT props FROM test_table 
WHERE app_id='myapp1';
//Must scan all the cluster for rows, 
//but return only those with specific app_id

SELECT props FROM test_table 
WHERE user='user1'
AND domain_id='mydomain1';
//Must scan all rows having user='user1' (all app_ids), 
//but return only those having specific domain

SELECT props FROM test_table 
WHERE user='user1'
AND app_id > 'abc' AND app_id < 'xyz'
AND domain_id='mydomain1';
//Must scan the range of rows satisfying <user, app_id> condition,
//but return only those having specific domain

What to do?

In Cassandra it's not possible to create a secondary index on the part of the primary key. There are few options, each having its pros and cons:

Add a separate table that has primary key ((app_id), domain_id, user) and duplicate the necessary data in two tables. It will allow you to query necessary data for a specific app_id or <app_id, domain_id> combination. If you need to query specific domain and all apps - third table is necessary. This approach is called materialized views
Use some sort of parallel processing (hadoop, spark, etc) to perform necessary calculations for all app/domain combinations. Since Cassandra needs to read all the data anyway, there probably won't be much difference from a single pair. If the result for other pairs might be cached for later use, it will probably save some time.
Just use ALLOW FILTERING if query performance is acceptable for your needs. Dozens of millions partition keys is probably not too much for Cassandra.

get_range_slices and CQL query handling, need for ALLOW FILTERING

2 Answers

Short answer

Longer explanation

What to do?