
Spark Version : 1.4.0 Cassandra Version : 2.1.8

I am using the datastax Spark Cassandra connector to bridge both Spark and Cassandra. I am having a 6 nodes in Spark running with 6 different workers. I have 2 Cassandra nodes assisting this.

I tried a sample application to perform the count of number of rows in a column family( CassandraUtil.javaFunctions(sc).cassandraTable("keyspace","columnfamily").count()).

Now , when I dispatch this single job to the master, the job ran in 2 worker nodes in Spark Cluster( Got from the Event Timeline).


  1. I dispatched a single job. Why it was done by two workers? Is it like one worker acts like a master here?
  2. I found the deserialisation time to be very high in one worker. Other worker completed the job pretty fast( 1 took 40 seconds and 2 took 1 second). Can you throw some light on this?
  3. Both the workers seems to have established a connection with Cassandra and has returned a result. So , in my view, both are doing the same job. Can you throw some light on this?
  4. I am still wondering where the implementation of RDD will fit in this distributed realm with Cassandra . Can someone throw some light on this? How does multiple workers know which partition of Cassandra they have to work on , if it can , say ,split 10k partitions among 6 workers? Is it like ,fetching is all done by one worker and processing is done by 6 of them? Even in that case, execution logic remains the same in all workers(fetch from Cassandra and process). How does Spark do this?
  5. Would like to know the real advantage of using Spark with Cassandra. Is it at memory management level or it has some other advantages?


enter image description here

I have added the picture of the run. I just have 10 different partitions. This is a simple count operation.

My question still remains a puzzle i guess.

If you see the attachment provided, you will get an idea I suppose. This was for a single job submit to my spark master. Wondering how it runs in two different executors. Both the executors are returning same number of bytes . So , that goes to show that both have fetched the all the 10 partitions from cassandra . If this is the way it happens, what does spark provide me over cassandra? Or , do I have to fetch it in some other way, so that, ten partitions are fetched by two different workers?


1 Answers


I recommend that you spend a few hours reading up on Spark and C*. I have some recommended material that I've picked out at the bottom of this post.

Let me take on your questions for now:

I dispatched a single job. Why it was done by two workers? Is it like one worker acts like a master here?

Probably has to do with either resource availability or the amount of partitions in your job (probably the latter).

As Russ puts it "Increase the parallelism of your job. Try increasing the number of partitions in your job. By splitting the work into smaller sets of data less information will have to be resident in memory at a given time. For a Spark Cassandra Connector job this would mean decreasing the split size variable."

To tune this in 1.2 use:

spark.cassandra.input.split.size spark.cassandra.output.batch.size.rows spark.cassandra.output.batch.size.bytes

In the newer versions, you also have: spark.cassandra.output.throughput_mb_per_sec

I found the deserialisation time to be very high in one worker. Other worker completed the job pretty fast( 1 took 40 seconds and 2 took 1 second). Can you throw some light on this?

From Kay who actually added the feature to the web ui:

"Time to deserialize the task can be large relative to task time for short jobs, and understanding when it is high can help developers realize that they should try to reduce closure size (e.g, by including less data in the task description)."

Both the workers seems to have established a connection with Cassandra and has returned a result. So , in my view, both are doing the same job. Can you throw some light on this?

Spark works in parallel. Because this is a distributed computing paradigm you take advantage of multiple nodes and multiple cores by kicking off executors that do work in parallel. Both executors will pull data from C* but they'll pull different data based on partitioning.

See some of the intro videos for details.

I am still wondering where the implementation of RDD will fit in this distributed realm with Cassandra . Can someone throw some light on this? How does multiple workers know which partition of Cassandra they have to work on , if it can , say ,split 10k partitions among 6 workers? Is it like ,fetching is all done by one worker and processing is done by 6 of them? Even in that case, execution logic remains the same in all workers(fetch from Cassandra and process). How does Spark do this?

Each will fetch and process their own data based on partitioning.

To get info on how your job will be partitioned use:


If you're colocating Spark and Cassandra, as is the case in DSE, you get the advantage of data locality (no need to stream data from c* to the spark workers).

Would like to know the real advantage of using Spark with Cassandra. Is it at memory management level or it has some other advantages?

There's probably too many to list here, see the recommended reading/viewing. Big hitters being sql style queries (joins, aggregations, groupby's etc.) for batch and streaming analytics + fancy statistical modeling with MLLIB, analytical graph with graphx, etc. etc.

Here is some good material that should get you up to speed:

This is a high-level presentation from Russ on what's possible with Spark and C*: http://www.slideshare.net/planetcassandra/escape-from-hadoop

OReily Webinar with Sameer from DataBricks on how DSE integrates with Spark: http://www.oreilly.com/pub/e/3234

How the connector reads data: https://academy.datastax.com/demos/how-spark-cassandra-connector-reads-data

Critical posts on troubleshooting spark will be helpful once you're actually trying to get stuff to work. These will answer most of your opps/perf questions: http://www.datastax.com/dev/blog/common-spark-troubleshooting


Two Similar and also valuable posts from Sandy (not c* specific): http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-1/ http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/