Recetly we were calculating some statistics using datastax spark connector. Repeated queries were returrning different results on each execution.
Background: we have approx. 112K records in 3-node cassandra cluster. The table have single partition key UUID
column named guid
and no clustering key columns.
This is simple guid
extractor i defined to examine losses:
val guids = sc.cassandraTable[UUID]("keyspace","contracts").select("guid")
Next i've repeatedly extracted data to local collections several times
val gss = List.fill(20)(Set(guids.collect():_*))
val gsall = gss reduce (_ | _)
val lost = gss map (gsall &~ _ size)
Resulted lost
is List(5970, 7067, 6926, 6683, 5807, 7901, 7005, 6420, 6911, 6876, 7038, 7914, 6562, 6576, 6937, 7116, 7374, 6836, 7272, 7312)
so we have 6,17±0,47%
data loss each query
Could this be the problem of cassandra, spark or connector? And in each case - are there exist some configuration way to prevent this?