I am making a small Spark application using the Spark Cassandra connector and dataframes in python, but I am getting extremely low write speeds. When I look in the application logs it says:
17/03/28 20:04:05 INFO TableWriter: Wrote 315514 rows to movies.moviescores in 662.134 s.
Which is approximately 474 rows per second.
I am reading some data from Cassandra into a table, then I do some operations on them (which also make the set a lot larger). And then I am writing the result back to cassandra (approximately 50 million rows):
result.write.format("org.apache.spark.sql.cassandra").mode('append').options(table="moviescores", keyspace="movies").save()
Where result is a dataframe.
Here is the creation of my keyspace, if it matters:
CREATE KEYSPACE IF NOT EXISTS movies WITH REPLICATION = { \'class\' : \'NetworkTopologyStrategy\', \'datacenter1\' : 3 };
And the table I am writing to:
CREATE TABLE IF NOT EXISTS movieScores(movieId1 int, movieId2 int, score int, PRIMARY KEY((movieId1, movieId2)));
My setup is as follows: I have 5 Spark workers running in Docker containers each on a different node running CoreOS with 2 GB of RAM and 2 cores running at Digitalocean. 3 Cassandra nodes running in Docker Containers each on a different node running CoreOS with 2 GB of ram and 2 cores running at Digitalocean.
The nodes running Spark have 2 GB of RAM, but they can only use up to 1 GB as this is Sparks default setting for Standalone mode:
(default: your machine's total RAM minus 1 GB)
Not sure if it's wise to raise this.
Now I have read that I should run a Spark Worker and a Cassandra node on each node in my Digital Ocean cluster. But I am not sure if it's a good idea to run a Docker container with Spark and another Container with a Cassandra node on a 2GB machine with only 2 cores.
Why is it writing so slow? Are there are parameters/settings that I should change/set in order to increase write speeds? Perhaps my setup is all wrong? I am quite new to Spark and Cassandra.
Update: I just did a test on the same table without Spark, using just the Cassandra connector for Python and a small Python program on my laptop. I used batch insert with batches of 1000 rows and I could insert 1 million rows in just 35 seconds, which is almost 30000 rows per second, way faster. So perhaps Spark is the issue, rather than Cassandra. Perhaps it would make sense to put the rest of my code here? or perhaps something is wrong with my setup?