Cassandra Reading Benchmark with Spark

Question

I'm doing a benchmark on Cassandra's Reading performance. In the test-setup step I created a cluster with 1 / 2 / 4 ec2-instances and data nodes. I wrote 1 table with 100 million of entries (~3 GB csv-file). Then I launch a Spark application which reads the data into a RDD using the spark-cassandra-connector.

However, I thought the behavior should be the following: The more instances Cassandra (same instance amount on Spark) uses, the faster the reads! With the writes everything seems to be correct (~2-times faster if cluster 2-times larger).

But: In my benchmark the read is always faster with a 1-instance-cluster then with a 2- or 4-instance-cluster!!!

My Benchmark Results:

Cluster-size 4: Write: 1750 seconds / Read: 360 seconds

Cluster-size 2: Write: 3446 seconds / Read: 420 seconds

Cluster-size 1: Write: 7595 seconds / Read: 284 seconds

ADDITIONAL TRY - WITH THE CASSANDRA-STRESS TOOL

I launched the "cassandra-stress" tool on the Cassandra cluster (size 1 / 2 / 3 / 4 nodes), with following results:

Clustersize    Threads     Ops/sek  Time
1              4           10146    30,1
               8           15612    30,1
              16           20037    30,2
              24           24483    30,2
             121           43403    30,5
             913           50933    31,7
2              4            8588    30,1
               8           15849    30,1
              16           24221    30,2
              24           29031    30,2
             121           59151    30,5
             913           73342    31,8
3              4            7984    30,1
               8           15263    30,1
              16           25649    30,2
              24           31110    30,2
             121           58739    30,6
             913           75867    31,8
4              4            7463    30,1
               8           14515    30,1
              16           25783    30,3
              24           31128    31,1
             121           62663    30,9
             913           80656    32,4

Results: With 4 or 8 threads the single-node cluster is as fast or faster then the larger clusters!!!

Results as diagram: enter image description here
The data-sets are the cluster sizes (1/2/3/4), x-axis the threads, and y-axis the ops/sec.

--> Question here: Are these results the cluster-wide results or is this a test for a local node (and so the result of only one instance of the ring)???

Can someone give an explanation? Thank you!

Are you running a spark worker instance on each Cassandra node? — Jim Meyer
No, I've setup the clusters separeted of each other. So I have a Spark Cluster with 1 / 2 / 4 Workernodes and a Cassandra Cluster with 1 / 2 / 4 nodes. — D. Müller

Jim Meyer Jim Meyer · Accepted Answer · 2015-07-10T20:22:46

I ran a similar test with a spark worker running on each Cassandra node.

Using a Cassandra table with 15 million rows (about 1.75 GB of data), I ran a spark job to create an RDD from the table with each row as a string, and then printed a count of the number of rows.

Here are the times I got:

1 C* node, 1 spark worker - 1 min. 42 seconds
2 C* nodes, 2 spark workers - 55 seconds
4 C* nodes, 4 spark workers - 35 seconds

So it seems to scale pretty well with the number of nodes when the spark workers are co-located with the C* nodes.

By not co-locating your workers with Cassandra, you are forcing all the table data to go across the network. That will be slow and perhaps in your environment is a bottleneck. If you co-locate them, then you benefit from data locality since spark will create the RDD partitions from the tokens that are local to each machine.

You may also have some other bottleneck. I'm not familiar with EC2 and what it offers. Hopefully it has local disk storage rather than network storage since C* doesn't like network storage.

Cassandra Reading Benchmark with Spark

1 Answers