1
votes

Before creating an Cassandra improvement ticket, I am curious what is the technical limitation to not allow column querying without secondary indices on them even when entire Primary Key (partition_key and clustering_key) is specified? With the PK, Cassandra is already at the specific partition row and can avoid returning the row based on column value filtering in place. There is much more benefit if this can be done by specifying just partition key, instead of returning so many wide rows and filtering at the client, it can filter the data itself on server and only return the matching rows directly with ALLOW FILTERING - that client knows the risk?

select * from CF where partition_key = foo and clustering_key = bar and non_indexed_column = baz

When you do use secondary index with a partition key query, execution plan shows that it uses Partition Key to get to the row first and only then uses then single partition key of index scan if data exists and then probably in place filtering to return mutually common data anyways. When you use multiple secondary indices, there is an optimization to pick the most optimal one first.

I do understand a default secondary index tree is maintained in memory like any other index data structure and index is actually a reverse column family lookup to the partition key indexing just the local data within the same node.

My question is around the "big technical overhead or limitation" of Cassandra not being able to do this instead of pushing it to the client when the entire Primary Key is specified?

Execution Plan summary with Primary Key and Secondary Index:
Seeking to partition beginning in data file | xyz
Executing single-partition query on indexed_column_idx
Seeking to partition indexed section in data file
Merging data from memtables and 15 sstables

Execution Plan summary with just the Secondary Index:
Executing indexed scan 
Executing single-partition query on indexed_column_idx
...

Both of these make sense. Secondary indexes are limited to high cardinality columns and then, you cannot create many secondary indices without having them abused and neither can you create new reverse lookup CFs by the index without worrying about space and consistency.

1

1 Answers

0
votes

Tried same query on a Cassandra 2.2+ instances and they all work fine :), you can "filter any column" as long as you specify the partition key. Only catch is you have to specify ALLOW FILTERING, meaning the client takes the risk/burden if it is slow and inefficient due a full scan of the wide row.

See https://www.datastax.com/dev/blog/a-deep-look-to-the-cql-where-clause