2
votes

I have a cassandra Table with a composite partition key as (time_bucket timestamp, node int). time_bucket value is the time the data was inserted with seconds converted to 00 and node values range from 0 to 100

A spark job runs every minute picking up data from the table. The table contains close to 25 million records with records being added every minute.

If my spark job selects all the records every time it runs, the job completes in 2 minutes. But if i query using:

sc.cassandraTable(keyspace_name,table_name).where("time_bucket = ? ", from).where("nodeid_bucket IN ? ", nodeid_bucket_range)

where val nodeid_bucket_range = 0 to 100,

the job takes 10 minutes to complete.

My cluster has 6 nodes and am using DSE 4.8.9. Each executor uses 8 cores and 20GB of memory. Increasing these values does not help to make the spark job faster.

Any idea why my job takes 10 minutes? Does spark-cassandra not function well when using IN clause?

2

2 Answers

1
votes

You probably want joinWithCassandraTable. Almost always an In clause is better served by doing a join if you have a large number of values. This will execute all of your requests in parallel on different executors.

https://github.com/datastax/spark-cassandra-connector/blob/master/doc/2_loading.md#using-joinwithcassandratable

0
votes

IN statements translate into key OR key OR key3 ... OR key100, which is very inefficient for the optimizer to make something useful of. In your case you could use:

sc.cassandraTable(keyspace_name,table_name).where("time_bucket = ? ", from).where("nodeid_bucket > ? AND nodeid_bucket < ? ", nodeid_bucket_range)

Watch the edges of the range and of course this assumes your range is continuous.