4
votes

I'm currently using this stack:

  • Cassandra 2.2 (multinode)
  • Spark/Streaming 1.4.1
  • Spark-Cassandra-Connector 1.4.0-M3

I've got this DStream[Ids] with RDDs counting around 6000-7000 elements. id is the Partition Key.

val ids: DStream[Ids] = ...
ids.joinWithCassandraTable(keyspace, tableName, joinColumns = SomeColumns("id"))

As tableName gets bigger, let's say around 30k "rows", the query takes much longer, and I'm having trouble staying under the batch duration threshold. It performs similarly to using a massive IN-clause, which I've understood is not advisable.

Are there more effective ways to do this?

Answer: Always remember to repartition your local RDDs with repartitionByCassandraReplica before doing joins with Cassandra to ensure that each partition is only working against the local Cassandra node. In my case, I also had to ramp up the partitions on the joining local RDD/DStream in order for the tasks to spread evenly across workers.

1

1 Answers

3
votes

Is "id" a partition key in your table? If not, I think it needs to be, otherwise you may be doing a table scan which would run progressively slower as the table gets bigger.

Also to get good performance with this method I believe you need to use the repartitionByCassandraReplica() operation on your ids RDD so that the joins are a local operation on each node.

See this.