I'm currently using this stack:
- Cassandra 2.2 (multinode)
- Spark/Streaming 1.4.1
- Spark-Cassandra-Connector 1.4.0-M3
I've got this DStream[Ids] with RDDs counting around 6000-7000 elements. id is the Partition Key.
val ids: DStream[Ids] = ...
ids.joinWithCassandraTable(keyspace, tableName, joinColumns = SomeColumns("id"))
As tableName gets bigger, let's say around 30k "rows", the query takes much longer, and I'm having trouble staying under the batch duration threshold. It performs similarly to using a massive IN-clause, which I've understood is not advisable.
Are there more effective ways to do this?
Answer:
Always remember to repartition your local RDDs with repartitionByCassandraReplica before doing joins with Cassandra to ensure that each partition is only working against the local Cassandra node. In my case, I also had to ramp up the partitions on the joining local RDD/DStream in order for the tasks to spread evenly across workers.