2
votes

I have a Set up with: 5 Cassandra node cluster with RF =3, I performed a secondary index for a column in the table 'user',

1) As per my study on Secondary Index using the link: https://www.datastax.com/dev/blog/cassandra-native-secondary-index-deep-dive I understood that secondary indexes will be stored in the local node. Does it mean that in the five node cluster only in one node the secondary index will be available? If not in the RF =3 for user table, In how many nodes the Secondary Index table will be available?

2) How does the following two query differ in execution?

   CREATE TABLE user(
    user_group int PRIMARY KEY,
    user_name text,
    user_phone varint
   );

  CREATE INDEX username_idx ON user (user_name);

In this table setup,

Query 1 : SELECT * FROM user WHERE user_name = 'test';

Query 2 : SELECT * FROM user WHERE user_group = 1 AND user_name = 'test';

How many nodes (In the 5 node cluster) will the above two queries pass through for execution and How the two queries differ in performance?

Edited :

Say I have a table like below,

CREATE TABLE nodestat (
    uniqueId text,
    totalCapacity int,
    physicalUsage int,
    flashMode text,
    timestamp timestamp,
    primary key (uniqueId, timestamp)) 
    with clustering order by (timestamp desc);

CREATE CUSTOM INDEX nodeIp_idx ON nodestat(flashMode)

Query 3 : select * from nodestat where uniqueId = 'test' AND flashMode = 'yes'

So In this case, I always have only one partition in the table, so How does the secondary index search differ compare to the secondary index without partition key? How efficient is it?

1
Refers docs.datastax.com/en/cql/3.3/cql/cql_using/… to see when index and when no indexJacky1205

1 Answers

4
votes

Regd your Question 1:

Does it mean that in the five node cluster only in one node the secondary index will be available?

The secondary index is available in every node of the cluster, built upon the data in that node and its just local to that node. That is, its aware of only the primary keys in that particular node. You can imagine the secondary index to be a lookup table with references to primary keys on that node.

So every node builds its own secondary index (in your case all 5), but unaware of each others references.

If not in the RF =3 for user table, In how many nodes the Secondary Index table will be available?

There is no replication factor for secondary indexes, since its local to every node. Since your data is already being replicated RF = 3, your secondary indexes in every node will have that indexed.

Regd your Question 2:

Query 1 : SELECT * FROM user WHERE user_name = 'test';

This query is going to perform a scatter gather on all nodes in the cluster. Since the secondary indexes are local to each node, every node (in your case all 5) has to execute the query -> perform a secondary index lookup to figure out the partition key -> then fetch the actual results back to coordinator.

As the table grows bigger, the query often results in timeout. In extreme cases it can bring down the node (just like "select *" without partition key). Hence secondary indexes and this type of query (without partition key) in general are discouraged in Cassandra and better to avoid them

Query 2 : SELECT * FROM user WHERE user_group = 1 AND user_name = 'test';

This query will perform better compared to the previous one, as it has filter on partition key. In the table definition above there is no clustering column, so this query would just filter on primary key as there is only one row per partition. Hence there isn't much improvement with secondary index. Overall its not a scatter gather type of query and hence perform much better.

edited to explain query3

Query 3 : select * from nodestat where uniqueId = 'test' AND flashMode = 'yes'

In this query the secondary index is used in conjunction with partition key. This secondary index would help in case of 1000s of clustering columns exists for a given partition key and we want to quickly narrow down on the resultset. Remember the secondary index stores the entire primary key (partition key + clustering column reference). So in case of a wide partition, this secondary index proves useful when used alongside a partition key.

For example in your case, say there is only one partition uniqueId = 'test'. But within that partition 'test', say there are 10000 different timestamp values (clustering column). So potentially there could be 10000 different values for "flashMode". This secondary index will help narrow down to the "flashMode" column with value "yes" within the partition 'test' amongst that 10000 matches.