Cassandra count with limit

Question

I need to find out if the count of records in Cassandra table is greater than certain number, e.g 10000.

I still don't have large data set, but at a large scale, with possible billions of records, how would I be able to achieve this efficiently?

There could potentially be billions of records, or just thousands. I just need to know if there are more or less than 10K.

This below doesn't seem right, I think it would fail or be very slow for large number of records.

SELECT COUNT(*) FROM data WHERE sourceId = {id} AND timestamp < {endDate} AND timestamp > {startDate};

I could also do something like this:

SELECT * FROM data WHERE sourceId = {id} AND timestamp < {endDate} AND timestamp > {startDate} LIMIT 10000;

and count in memory

I can't have new table used for counting, e.g, when a new record is written, increase counter, that option is unacceptable.

Is there some other way to do this? Select with limit looks dumb, but seems most viable.

sourceId is partition key and timestamp is clustering key.

Cassandra version is 3.11.4, and I work in Spring if it has any relevance.

exploder86 exploder86 · Accepted Answer · 2021-03-21T15:39:23

You may introduce bucket_id into partition key, so primary key will be ((sourceId, bucket_id), timestamp). Bucketing is used cassandra to constraint data rows belonging to single partition, i.e. partition will be split into smaller chunks. To count all rows issue async query for each partition (source_id, bucket_id) with additional timestamp field. Bucket_id may_be derived from timestamp so that is possible define which bucket_id is required to access.

Another solutions:

use cassandra's counters (but I read it affect performance, and cannot correctly handle repeat and speculative queries)
use another db, like redis which has atomic counters (but how synchronize redis and cassandra?)
precalculate values and save it's during write (for example into static columns)
something else

Cassandra count with limit

2 Answers