2
votes

Given Cassandra table:

CREATE TABLE data_storage.stack_overflow_test_table (
    id int,
    text_id text,
    clustering date,
    some_other text,
    PRIMARY KEY (( id, text_id ), clustering)
)

the following query is a valid query:

select * from data_storage.test_table_filtering where id=4 and text_id='2';

Since I included all columns from partitioning key to query.

Consider following code:

val ds = session.
  read
  .format("org.apache.spark.sql.cassandra")
  .options(Map("table" -> "stack_overflow_test_table", "keyspace" -> "data_storage"))
  .load()
  .where(col("id") === 4 &&
  col("text_id") === "2").show(10)

Since spark-cassandra connector pushes predicate to Cassandra, I expect query that Spark will send Cassandra be something like

SELECT "id", "text_id", "clustering", "some_other" FROM "data_storage"."stack_overflow_test_table" WHERE "id" = ? AND "text_id" = ? 

However, I can see in logs

18/04/09 15:38:09 TRACE Connection: Connection[localhost/127.0.0.1:9042-2, inFlight=1, closed=false], stream 256, writing request PREPARE SELECT "id", "text_id", "clustering", "some_other" FROM "data_storage"."stack_overflow_test_table" WHERE "id" = ? AND "text_id" = ? ALLOW FILTERING

That means spark-cassandra-connector adds ALLOW FILTERING to query

Therefore I have two questions:

  1. Does this affecting performance?
  2. Is there a workaround?
1

1 Answers

5
votes

Cassandra's connector documents that allow filtering is added implicitly. See here. Note how it warns about not all predicates being OK with the actual database.

  1. "Does this affecting performance?"

    The documentation says:

    Note: Although the ALLOW FILTERING clause is implicitly added to the generated CQL query, not all predicates are currently allowed by the Cassandra engine. This limitation is going to be addressed in the future Cassandra releases. Currently, ALLOW FILTERING works well with columns indexed by clustering columns.

    I read this as performance wouldn't be affected as a result of the implicit allow filtering

  2. "Is there a workaround?"

    Workaround for making the query faster or for preventing that 'allow filtering' be sent? The simple answer is that there should be no need for a "workaround". Send a predicate that makes an efficient query for Cassandra, just as in your case, and the database engine will pick the best execution plan.