IN query on secondary index in cassandra when partition key is specified

Question

I'm working with a system that uses a secondary index in cassandra, along with a composite primary key, e.g.

CREATE TABLE table (
  a bigint,
  b bigint,
  c bigint,
  PRIMARY KEY (a, b, c)
) WITH CLUSTERING ORDER BY (b ASC, c ASC)
CREATE INDEX secondary_index ON table (c);

One of the operations in the application using the table is to fetch a number of rows (typically tens) specifying the partition key and the secondary index key. Currently, it performs one query for each (partition key, secondary key) pair, in parallel, which works fine, e.g.:

select * from table where a = ? and c = ?;

However, I've noticed that the system's workload is such that most of the time, there is significant overlap in the partition keys across the requested rows, sometimes more than half of them have the same partition key. So, I thought that it might be more efficient to perform one query per partition key, with an IN clause on the secondary key, reducing the number of overall queries to single digits in most cases, and reducing read query overhead on the cluster.

However, at least executed from cqlsh, this does not seem to be allowed:

select * from table where a = ? and c in (...);
InvalidRequest: Error from server: code=2200 [Invalid query] message="PRIMARY KEY column "c" cannot be restricted as preceding column "b" is not restricted"

Is this just not allowed, and I'll have to continue making individual queries? Is there some reason it wouldn't actually be more efficient? Or is this just a limitation of CQL, and IN queries cannot use the secondary index? Perhaps there is an issue because the secondary index key is also in the primary key, and Cassandra attempts to use that instead of the secondary index?

there are different restrictions on using secondary indexes, and this may depend on the version of Cassandra - what are you using? Also, how many rows do you have per partition? sometimes it could be ok to execute with allow filtering — Alex Ott
cassandra version is 3.0.17, and the attempts I've made have been with cqlsh 5.0.1. Rows per partition vary, but are often in the hundreds or thousands, and up to hundreds of thousands. It doesn't seem like the query should need filtering given the secondary index, and given the partition size relative to number of rows requested, I don't think it would be good to allow it. — jstultz

hengsok hengsok · Accepted Answer · 2021-07-21T05:09:43

You are not allowed to execute

select * from table where a = ? and c = ?;

Because then it means that Cassandra has to scan over the whole partition 'a' just to find all the values where c = 'your defined value'. This is because Cassandra do not have any info on what value b is and cannot pinpoint directly to the row.

There are good explanations on this page for most of query patterns. https://www.datastax.com/blog/deep-look-cql-where-clause

IN query on secondary index in cassandra when partition key is specified

1 Answers