3
votes

My Cassandra schema contains a table with a partition key which is a timestamp, and a parameter column which is a clustering key.

Each partition contains 10k+ rows. This is logging data at a rate of 1 partition per second.

On the other hand, users can define "datasets" and I have another table which contains, as a partition key the "dataset name" and a clustering column which is a timestamp referring to the other table (so a "dataset" is a list of partition keys).

Of course what I would like to do looks like an anti-pattern for Cassandra as I'd like to join two tables.

However using Spark SQL I can run such a query and perform the JOIN.

SELECT * from datasets JOIN data 
    WHERE data.timestamp = datasets.timestamp AND datasets.name = 'my_dataset'

Now the question is: is Spark SQL smart enough to read only the partitions of data which correspond to the timestamps defined in datasets?

1

1 Answers

4
votes

Edit: fix the answer with regard to join optimization

is Spark SQL smart enough to read only the partitions of data which correspond to the timestamps defined in datasets?

No. In fact, since you provide the partition key for the datasets table, the Spark/Cassandra connector will perform predicate push down and execute the partition restriction directly in Cassandra with CQL. But there will be no predicate push down for the join operation itself unless you use the RDD API with joinWithCassandraTable()

See here for all possible predicate push down situations: https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector/src/main/scala/org/apache/spark/sql/cassandra/BasicCassandraPredicatePushDown.scala