I'm using Apache Spark with Spark Cassandra Connector to write millions of rows to a Cassandra cluster. Replication factor is set to 3 and I set the write consistency to ALL in spark-submit (YARN client mode) using the following options:
spark-submit ...
--conf spark.cassandra.output.consistency.level=ALL \
--conf spark.cassandra.output.concurrent.writes=1 \
--conf spark.cassandra.output.batch.size.bytes=20000 \
...
I then wrote another Spark job to count the data I have written. I set up consistency of the new Job as follow:
spark-submit ...
--conf spark.cassandra.input.consistency.level=ONE \
--conf spark.cassandra.input.split.size=50000 \
...
From the documentation, if the write consistency plus the read consistency is greater than the replication factor, I should have consistent reads.
But I'm getting the following results:
- The read job gives me different results (count) every time I run it
- If I increase the consistency level of the read Job I get the expected results
What am I missing ? Is there any secret configuration that is set up by default (e.g. in case of issues during write then decrease the consistency level, or something like that...) or am I using a buggy version of Cassandra (it's 2.1.2), or are there issues with batch updates that spark-cassandra-connector uses for saving data to Cassandra (I'm simply using the "saveToCassandra" method) ?
What's going wrong ?