Cassandra spark connector data loss

Question

Recetly we were calculating some statistics using datastax spark connector. Repeated queries were returrning different results on each execution.

Background: we have approx. 112K records in 3-node cassandra cluster. The table have single partition key UUID column named guid and no clustering key columns.

This is simple guid extractor i defined to examine losses:

val guids = sc.cassandraTable[UUID]("keyspace","contracts").select("guid")

Next i've repeatedly extracted data to local collections several times

val gss = List.fill(20)(Set(guids.collect():_*))
val gsall = gss reduce (_ | _)
val lost = gss map (gsall &~ _ size)

Resulted lost is List(5970, 7067, 6926, 6683, 5807, 7901, 7005, 6420, 6911, 6876, 7038, 7914, 6562, 6576, 6937, 7116, 7374, 6836, 7272, 7312)

so we have 6,17±0,47% data loss each query

Could this be the problem of cassandra, spark or connector? And in each case - are there exist some configuration way to prevent this?

Odomontois Odomontois · Accepted Answer · 2015-06-09T10:53:40

I've read some docs and learned that reading consistensy level could and should be set for such situations. After declaring

implicit val readConf = ReadConf.fromSparkConf(sc.getConf).copy(
    consistencyLevel = ConsistencyLevel.ALL)

I've got my stable result.

Cassandra spark connector data loss

1 Answers