3
votes

I'm using Apache Spark with Spark Cassandra Connector to write millions of rows to a Cassandra cluster. Replication factor is set to 3 and I set the write consistency to ALL in spark-submit (YARN client mode) using the following options:

spark-submit ...
--conf spark.cassandra.output.consistency.level=ALL \
--conf spark.cassandra.output.concurrent.writes=1 \
--conf spark.cassandra.output.batch.size.bytes=20000 \
...

I then wrote another Spark job to count the data I have written. I set up consistency of the new Job as follow:

spark-submit ...
--conf spark.cassandra.input.consistency.level=ONE \
--conf spark.cassandra.input.split.size=50000 \
...

From the documentation, if the write consistency plus the read consistency is greater than the replication factor, I should have consistent reads.

But I'm getting the following results:

  • The read job gives me different results (count) every time I run it
  • If I increase the consistency level of the read Job I get the expected results

What am I missing ? Is there any secret configuration that is set up by default (e.g. in case of issues during write then decrease the consistency level, or something like that...) or am I using a buggy version of Cassandra (it's 2.1.2), or are there issues with batch updates that spark-cassandra-connector uses for saving data to Cassandra (I'm simply using the "saveToCassandra" method) ?

What's going wrong ?

1
This sounds like a bug to me, Could you report it on the Connector github site? I'm guessing that the output consistency level isn't being appropriately applied. My cursory glance at the code makes me think that it is working but more details would help (write command, cluster setup, number of nodes, dcs ...)RussS

1 Answers

3
votes

I confirm that this is a bug in connector. Consistency level is being set on individual prepared statements and is simply ignored in case when we use batch statements. Follow the updates on the connector - the fix is going to be included in the next bug-fix release.