0
votes

I have cassandra table with following structure:

CREATE TABLE table ( key int, time timestamp, measure float, primary key (key, time) );

I need to create a Spark job which will read data from previous table, within specified start and end timestamp do some processing, and flush results back to cassandra.

So my spark-cassandra-connector will have to do a range query on clustering cassandra table column.

Are there any performance differences if I do:

sc.cassandraTable(keyspace,table).
as(caseClassObject).
filter(a => a.time.before(startTime) && a.time.after(endTime).....

so what I am doing is loading all the data into Spark and applying filtering

OR if I do this:

sc.cassandraTable(keyspace, table).
where(s"time>$startTime and time<$endTime)......

which filters all the data in Cassandra and then loads smaller subset to Spark.

The selectivity of a range query will be around 1% It is impossible to include partition key in the query.

Which of these two solutions is preferred?

1

1 Answers

2
votes
sc.cassandraTable(keyspace, table).where(s"time>$startTime and time<$endTime)

Will be MUCH faster. You are basically doing a percentage (if you only pull 5% of the data 5% of the total work) of the full grab in the first command to get the same data.

In the first case you are

  1. Reading all of the data from Cassandra.
  2. Serializing every object and then moving it to Spark.
  3. Then finally filtering everything.

In the second case you are

  1. Reading only the data you actually want from C*
  2. Serializing only this tiny subset
  3. There is no step 3

As an additional comment you can also put your case class type right in the call

sc.cassandraTable[CaseClassObject](keyspace, table)