1
votes

What is the Execution path for the following Query in Cassandra: - 5 rows from one Cassandra node with token 1(Node1) - 5 rows from one Cassandra node with token 2(Node2) - 5 rows from one Cassandra node with token 3(Node3)

A client sends a query to Node1. - What is the sequence this query is executed in 3 nodes? - How Node1 propagates this query to node2 and node3? - Node1 merges the rows from node2 and node3 to serve the complete query results?

1
What exact query are you issuing? - Richard
A query to fetch many rows from all the 3 nodes. I would like to understand how this query processing works in Cassandra nodes as given in the example in the question. - Vinodh
Are you doing queries like SELECT * FROM mytable; or SELECT * FROM mytable WHERE key in ('key1', 'key2');? It makes a difference to how it is processed. If you're using the thrift interface, are you using get_range_slices or multiget_slice? - Richard
I have not issued any queries yet. Gaining an understanding. Some more time before I start creating Column Families. I would like to know about the select * case first. If you could shed some light on "where clause on row keys" that will be useful too. - Vinodh

1 Answers

4
votes

There are two types of queries you can issue to retrieve data from multiple partitions (I'll use CQL terminology - a partition is what used to be called a row). Which one you use depends on if you know the partition keys or not.

I'll assume a simple schema that doesn't have any clustering keys:

CREATE TABLE mytable (key text PRIMARY KEY, field text);

If you don't know the partition keys, you can issue

SELECT * FROM mytable LIMIT 15;

This will return the first 15 rows, ordered by hash of the partition. Because it is ordered by the hash, such queries are only normally useful if you want to page through all your data.

The node that receives the query (the coordinator for this query) first forwards it to the node with the lowest token plus replicas. They return up to 15 rows. If fewer then the coordinator will forward on to the node with the second lowest token plus replicas. This happens until 15 rows are found, or until all nodes have been contacted. This query therefore could potentially contact every node in the cluster.

With replication factor greater than 1, conflicting results could be returned. The coordinator looks at the timestamps to merge the results and returns only the latest to the client.

If you do know the partition keys, you can use

SELECT * FROM mytable WHERE key in ('key1', 'key2');

The coordinator treats this the same as receiving the separate queries:

SELECT * FROM mytable WHERE key = 'key1';
SELECT * FROM mytable WHERE key = 'key2';

It forwards the messages to the node(s) who hold data for each key. There is one per key so these queries are executed in parallel. The responses are gathered on the coordinator, merged so only the latest remain, and sent to the client.