Spark-Cassandra very slow when using IN for composite partition key

Question

I have a cassandra Table with a composite partition key as (time_bucket timestamp, node int). time_bucket value is the time the data was inserted with seconds converted to 00 and node values range from 0 to 100

A spark job runs every minute picking up data from the table. The table contains close to 25 million records with records being added every minute.

If my spark job selects all the records every time it runs, the job completes in 2 minutes. But if i query using:

sc.cassandraTable(keyspace_name,table_name).where("time_bucket = ? ", from).where("nodeid_bucket IN ? ", nodeid_bucket_range)

where val nodeid_bucket_range = 0 to 100,

the job takes 10 minutes to complete.

My cluster has 6 nodes and am using DSE 4.8.9. Each executor uses 8 cores and 20GB of memory. Increasing these values does not help to make the spark job faster.

Any idea why my job takes 10 minutes? Does spark-cassandra not function well when using IN clause?

RussS RussS · Accepted Answer · 2017-01-25T22:13:55

You probably want joinWithCassandraTable. Almost always an In clause is better served by doing a join if you have a large number of values. This will execute all of your requests in parallel on different executors.

https://github.com/datastax/spark-cassandra-connector/blob/master/doc/2_loading.md#using-joinwithcassandratable

Spark-Cassandra very slow when using IN for composite partition key

2 Answers