Spark-cassandra-connector basics
The spark-cassandra-connector
has fairly complicated internals, but the most important things (overly simplified) are the following:
- the connector would naturally prefer to
query locally
. E.g to avoid network and to have the spark executor
query its local cassandra
node
- to do that, the driver needs to understand the Cassandra topology and where the token ranges you need to query are (there is an initial
ring describe
done by the driver, so after that there is a full understanding where to find what part of your token)
- after understanding where the token ranges are, and mapping each token to an IP, the connector spreads the work in such a way that each local spark executor queries that part of the range that is local to it
It's a bit more complex than that, but that's it in a nutshell. I think this video from Datastax explains it a bit better.
You might also want to consider reading this question (with, admittedly, a vague answer).
How you structure your data is important for this to work out of the box
Note that there is a bit of skill/knowledge required to structure your data and your query in such a way that the driver can try to do that.
Actually, the most common type of performance problems usually stem from badly structured data or queries leading to non-local execution. The datastax java
driver, and the spark-cassandra-connector
internally try their best effort to make the queries local, but you need to also follow the best practices in structuring your data. If you haven't already done so, I recommend reading/going through the trainings described in the Data Modeling By Example articles by DataStax.
Edit: queries without locality
As you mentioned, sometimes the executors don't reside on the same host as the nodes. Still, the principle is the same:
When you have a query, it is over a certain token range. Some of the data for this query will be "owned" by node A
, some of the data will be "owned" by node B
, and some by node C
.
The ring describe
operation tells the driver, for a certain range, which part of it is in node A
, which in node B
, and which in node C
. The driver then essentially splits the query in 3 subqueries and asks for it from the appropriate nodes which own the particular range.
Each node responds with their own portion, and at the end the driver aggregates it.
You might notice that local or not, the principle is exactly the same:
ask each node only about the particular range it owns, which the driver learned earlier by using the ring describe
operation.
Hope that makes it a bit clearer.