1
votes

I need to find out if the count of records in Cassandra table is greater than certain number, e.g 10000.

I still don't have large data set, but at a large scale, with possible billions of records, how would I be able to achieve this efficiently?

There could potentially be billions of records, or just thousands. I just need to know if there are more or less than 10K.

This below doesn't seem right, I think it would fail or be very slow for large number of records.

SELECT COUNT(*) FROM data WHERE sourceId = {id} AND timestamp < {endDate} AND timestamp > {startDate};

I could also do something like this:

SELECT * FROM data WHERE sourceId = {id} AND timestamp < {endDate} AND timestamp > {startDate} LIMIT 10000;

and count in memory

I can't have new table used for counting, e.g, when a new record is written, increase counter, that option is unacceptable.

Is there some other way to do this? Select with limit looks dumb, but seems most viable.

sourceId is partition key and timestamp is clustering key.

Cassandra version is 3.11.4, and I work in Spring if it has any relevance.

2
what is the table structure?Alex Ott

2 Answers

2
votes

You may introduce bucket_id into partition key, so primary key will be ((sourceId, bucket_id), timestamp). Bucketing is used cassandra to constraint data rows belonging to single partition, i.e. partition will be split into smaller chunks. To count all rows issue async query for each partition (source_id, bucket_id) with additional timestamp field. Bucket_id may_be derived from timestamp so that is possible define which bucket_id is required to access.

Another solutions:

  • use cassandra's counters (but I read it affect performance, and cannot correctly handle repeat and speculative queries)
  • use another db, like redis which has atomic counters (but how synchronize redis and cassandra?)
  • precalculate values and save it's during write (for example into static columns)
  • something else
1
votes

The first query:

SELECT COUNT(*) FROM data WHERE sourceId = {id} 
   AND timestamp < {endDate} AND timestamp > {startDate};

should work if you have a table with following primary key: (sourceId, timestamp, ...) - in this case, aggregation operation is executed inside the single partition, so it won't involve the hitting of multiple nodes, etc. It still may timeout if you have very slow disks, and too much data in given time range.

If you have another table structure, then you'll need to use something like Spark, that will read data from Cassandra, perform filtering, and counting...