There are two types of queries you can issue to retrieve data from multiple partitions (I'll use CQL terminology - a partition is what used to be called a row). Which one you use depends on if you know the partition keys or not.
I'll assume a simple schema that doesn't have any clustering keys:
CREATE TABLE mytable (key text PRIMARY KEY, field text);
If you don't know the partition keys, you can issue
SELECT * FROM mytable LIMIT 15;
This will return the first 15 rows, ordered by hash of the partition. Because it is ordered by the hash, such queries are only normally useful if you want to page through all your data.
The node that receives the query (the coordinator for this query) first forwards it to the node with the lowest token plus replicas. They return up to 15 rows. If fewer then the coordinator will forward on to the node with the second lowest token plus replicas. This happens until 15 rows are found, or until all nodes have been contacted. This query therefore could potentially contact every node in the cluster.
With replication factor greater than 1, conflicting results could be returned. The coordinator looks at the timestamps to merge the results and returns only the latest to the client.
If you do know the partition keys, you can use
SELECT * FROM mytable WHERE key in ('key1', 'key2');
The coordinator treats this the same as receiving the separate queries:
SELECT * FROM mytable WHERE key = 'key1';
SELECT * FROM mytable WHERE key = 'key2';
It forwards the messages to the node(s) who hold data for each key. There is one per key so these queries are executed in parallel. The responses are gathered on the coordinator, merged so only the latest remain, and sent to the client.
SELECT * FROM mytable;orSELECT * FROM mytable WHERE key in ('key1', 'key2');? It makes a difference to how it is processed. If you're using the thrift interface, are you using get_range_slices or multiget_slice? - Richard