My Cassandra schema contains a table with a partition key which is a timestamp, and a parameter
column which is a clustering key.
Each partition contains 10k+ rows. This is logging data at a rate of 1 partition per second.
On the other hand, users can define "datasets" and I have another table which contains, as a partition key the "dataset name" and a clustering column which is a timestamp referring to the other table (so a "dataset" is a list of partition keys).
Of course what I would like to do looks like an anti-pattern for Cassandra as I'd like to join two tables.
However using Spark SQL I can run such a query and perform the JOIN
.
SELECT * from datasets JOIN data
WHERE data.timestamp = datasets.timestamp AND datasets.name = 'my_dataset'
Now the question is: is Spark SQL smart enough to read only the partitions of data
which correspond to the timestamp
s defined in datasets
?