4
votes

I am making a small Spark application using the Spark Cassandra connector and dataframes in python, but I am getting extremely low write speeds. When I look in the application logs it says:

17/03/28 20:04:05 INFO TableWriter: Wrote 315514 rows to movies.moviescores in 662.134 s.    

Which is approximately 474 rows per second.

I am reading some data from Cassandra into a table, then I do some operations on them (which also make the set a lot larger). And then I am writing the result back to cassandra (approximately 50 million rows):

result.write.format("org.apache.spark.sql.cassandra").mode('append').options(table="moviescores", keyspace="movies").save()

Where result is a dataframe.

Here is the creation of my keyspace, if it matters:

CREATE KEYSPACE IF NOT EXISTS movies WITH REPLICATION = { \'class\' : \'NetworkTopologyStrategy\', \'datacenter1\' : 3 };

And the table I am writing to:

CREATE TABLE IF NOT EXISTS movieScores(movieId1 int, movieId2 int, score int, PRIMARY KEY((movieId1, movieId2)));

My setup is as follows: I have 5 Spark workers running in Docker containers each on a different node running CoreOS with 2 GB of RAM and 2 cores running at Digitalocean. 3 Cassandra nodes running in Docker Containers each on a different node running CoreOS with 2 GB of ram and 2 cores running at Digitalocean.

The nodes running Spark have 2 GB of RAM, but they can only use up to 1 GB as this is Sparks default setting for Standalone mode:

(default: your machine's total RAM minus 1 GB)

Not sure if it's wise to raise this.

Now I have read that I should run a Spark Worker and a Cassandra node on each node in my Digital Ocean cluster. But I am not sure if it's a good idea to run a Docker container with Spark and another Container with a Cassandra node on a 2GB machine with only 2 cores.

Why is it writing so slow? Are there are parameters/settings that I should change/set in order to increase write speeds? Perhaps my setup is all wrong? I am quite new to Spark and Cassandra.

Update: I just did a test on the same table without Spark, using just the Cassandra connector for Python and a small Python program on my laptop. I used batch insert with batches of 1000 rows and I could insert 1 million rows in just 35 seconds, which is almost 30000 rows per second, way faster. So perhaps Spark is the issue, rather than Cassandra. Perhaps it would make sense to put the rest of my code here? or perhaps something is wrong with my setup?

1
This might help you. Though this is old answer but I'll try to update or write new answer with few additional tips I came across recently.Nachiket Kate
Thanks, but I just did a test on the same table without Spark, using just the Cassandra connector for Python and a small Python program on my laptop. I used batch insert with batches of 1000 rows and I could insert 1 million rows in just 35 seconds, which is almost 30000 rows per second, way faster. So perhaps Spark is the issue, rather than Cassandra.SilverTear
Great. To verify spark is the bottleneck, try to measure spark throughput.Nachiket Kate
I'm sorry if this is a silly question, but what is the best way to do this? I am running a Spark standalone cluster. Also, if I go into my dashboard on Digitalocean, I can see that the spark nodes have about 90 % CPU usage on each node. Not sure if this is ok.SilverTear
I just ran another test by running a spark cluster locally with a single worker and it just as slow, so the issue doesn't appear to be in the cloud setup. How can it be that I can directly write to Cassandra at 30 k records a second, but once I use Spark and the Cassandra connector it becomes lethargic?SilverTear

1 Answers

0
votes

I recently ran into a similar problems when persisting more than 80 million records to Cassandra. In my case I used Spark Java API. What helped to resolve my issues was I applied orderBy() on a Dataset before saving it to Cassandra via spark-cassandra-connector. Try to order your dataset first and then save() to Cassandra.