Given Cassandra table:
CREATE TABLE data_storage.stack_overflow_test_table (
id int,
text_id text,
clustering date,
some_other text,
PRIMARY KEY (( id, text_id ), clustering)
)
the following query is a valid query:
select * from data_storage.test_table_filtering where id=4 and text_id='2';
Since I included all columns from partitioning key to query.
Consider following code:
val ds = session.
read
.format("org.apache.spark.sql.cassandra")
.options(Map("table" -> "stack_overflow_test_table", "keyspace" -> "data_storage"))
.load()
.where(col("id") === 4 &&
col("text_id") === "2").show(10)
Since spark-cassandra connector pushes predicate to Cassandra, I expect query that Spark will send Cassandra be something like
SELECT "id", "text_id", "clustering", "some_other" FROM "data_storage"."stack_overflow_test_table" WHERE "id" = ? AND "text_id" = ?
However, I can see in logs
18/04/09 15:38:09 TRACE Connection: Connection[localhost/127.0.0.1:9042-2, inFlight=1, closed=false], stream 256, writing request PREPARE SELECT "id", "text_id", "clustering", "some_other" FROM "data_storage"."stack_overflow_test_table" WHERE "id" = ? AND "text_id" = ? ALLOW FILTERING
That means spark-cassandra-connector adds ALLOW FILTERING to query
Therefore I have two questions:
- Does this affecting performance?
- Is there a workaround?