How Spark-cassandra-connector determines the range to query on Cassandra?

Question

I have a three node Cassandra cluster with Spark executor running on each node. I understand that to scan the Cassandra database, SCC(Spark-Cassandra-Connector) uses range query putting tokens in where clause. How a SCC instance running on different node is able to select ranges different from other SCC instances running on other nodes. For example a SCC instance A on node1 picks a range RangeA, then how SCC instances B and C decides not to use the same range RangeA?

Do they communicate with each other?

Nikola Yovchev Nikola Yovchev · Accepted Answer · 2021-07-18T20:54:22

Spark-cassandra-connector basics

The spark-cassandra-connector has fairly complicated internals, but the most important things (overly simplified) are the following:

the connector would naturally prefer to query locally. E.g to avoid network and to have the spark executor query its local cassandra node
to do that, the driver needs to understand the Cassandra topology and where the token ranges you need to query are (there is an initial ring describe done by the driver, so after that there is a full understanding where to find what part of your token)
after understanding where the token ranges are, and mapping each token to an IP, the connector spreads the work in such a way that each local spark executor queries that part of the range that is local to it

More detailed information

It's a bit more complex than that, but that's it in a nutshell. I think this video from Datastax explains it a bit better.

You might also want to consider reading this question (with, admittedly, a vague answer).

How you structure your data is important for this to work out of the box

Note that there is a bit of skill/knowledge required to structure your data and your query in such a way that the driver can try to do that.

Actually, the most common type of performance problems usually stem from badly structured data or queries leading to non-local execution. The datastax java driver, and the spark-cassandra-connector internally try their best effort to make the queries local, but you need to also follow the best practices in structuring your data. If you haven't already done so, I recommend reading/going through the trainings described in the Data Modeling By Example articles by DataStax.

Edit: queries without locality

As you mentioned, sometimes the executors don't reside on the same host as the nodes. Still, the principle is the same:

When you have a query, it is over a certain token range. Some of the data for this query will be "owned" by node A, some of the data will be "owned" by node B, and some by node C.

The ring describe operation tells the driver, for a certain range, which part of it is in node A, which in node B, and which in node C. The driver then essentially splits the query in 3 subqueries and asks for it from the appropriate nodes which own the particular range.

Each node responds with their own portion, and at the end the driver aggregates it.

You might notice that local or not, the principle is exactly the same:

ask each node only about the particular range it owns, which the driver learned earlier by using the ring describe operation.

Hope that makes it a bit clearer.