Unfortunately, the column I am using to fetch records for is not one of those in partition key. Could it be slow due to that?
Yes, that is likely why things are slow. Although to be fair, Spark is designed to query distributed data stores. It's not designed to be fast.
So I'm assuming that your PRIMARY KEY definition looks like this:
PRIMARY KEY((A,B),C)
The reason that querying by C
is slow, is because Cassandra (Spark) cannot determine which node in the cluster is responsible for the data based on C
. Therefore, every node needs to be checked for values of C
which satisfy your query.
Would querying by all 3 be faster?
Yes, querying by all three would likely be faster. This is because the partition key is made up of A
and B
. With a partition key based query, in this case, the key values of A
and B
are hashed together. That hash is matched-up against the token ranges that each node is responsible for. In this way, a target node containing the desired data can easily be determined, and there is no need to check each node for matching values.
If I were to query by using just 1 column from primary key (Let's say A), that would also be fast right?
No, it would not. Given the partition key definition of (A,B)
, the node containing the data cannot be determined by A
alone. In fact, as the result sets would likely be larger, querying by A
would probably be slower than querying by C
.