0
votes

I have a table in cassandra where date is not part of partition key but it is part of clustering key. While reading the table in spark I am applying date filter and it is being pushed down. I want to understand how push down works because through cql we cannot query directly on clustering key. Is the data being filtered somewhere?

Implementation in Java:

transactions.filter(transactions.col("timestamp").gt(timestamp))  //column timestamp is of type timestamp

and the physical plan coming out as

== Physical Plan == *Project [customer_user_id#67 AS customerUserId#111, cast(timestamp#66 as date) AS date#112, city#70] +- *Filter (isnotnull(timestamp#66) && isnotnull(city#70)) +- *Scan org.apache.spark.sql.cassandra.CassandraSourceRelation@571db8b4 [customer_user_id#67,timestamp#66,city#70] PushedFilters: [IsNotNull(timestamp), *GreaterThan(timestamp,2018-08-13 00:00:00.0), IsNotNull(city)], ReadSchema: struct<customerUserId:int,date:date,city:string>

Also for timestamp part this worked fine but if column is of type date then if was not pushing the filter even if date was part of partition key. I had to write it as transactions.filter("date >= cast('" + timestamp + "'as date)") to make it work. (column date is of type date)

1

1 Answers

1
votes

When you don't have a condition on partition key, Spark Cassandra connector uses token ranges to perform effective scanning in parallel. So if you have condition on some clustering column clasCol (greater as in your example), connector will generate following query (pseudo-code, not real code - you can find real CQL queries if you enable debug logging):

SELECT col1, col2, ... FROM ks.table WHERE
  token(pk) > :startRange AND token(pk) <= :endRange
  AND clasCol > :your-value ALLOW FILTERING;

Then Cassandra will perform effective range scans for multiple partitions on the same node. You can look for code here if you want more details.

Regarding date - it's requires looking more into the code, but maybe it's just some type conversion missing, or something like - you can check what queries were generated for both cases.