1
votes

I need to get a count from a very large dataset in Cassandra, 100 million plus. I am worried about the memory hit cassandra would take if I just ran the following query.

select count(*) from conv_org where org_id = 'TEST_ORG'

I was told I could use cassandra Automatic Paging to do this? Does this seem like a good option?

Would the syntax look something like this?

Statement stmt = new SimpleStatement("select count(*) from conv_org where org_id = 'TEST_ORG'");
stmt.setFetchSize(1000);
ResultSet rs = session.execute(stmt);

I am unsure the above code will work as I do not need a result set back I just need a count.

Here is the data model.

CREATE TABLE ts.conv_org (
   org_id text,
   create_time timestamp,
   test_id text,
   org_type int,
   PRIMARY KEY (org_id, create_time, conv_id)
)
2

2 Answers

2
votes

If org_id isn't your primary key counting in cassandra in general is not a fast operation and can easily lead to a full scan of all sstables in your cluster and therefore be painfully slow.

In Java for example you can do something like this:

   ResultSet rs = session.execute(...);
   Iterator<Row> iter = rs.iterator();
   while (iter.hasNext()) {
       if (rs.getAvailableWithoutFetching() == 100 && !rs.isFullyFetched())
           rs.fetchMoreResults();
       Row row = iter.next()
       ... process the row ...
   }

https://docs.datastax.com/en/drivers/java/2.0/com/datastax/driver/core/ResultSet.html

You could select a small colum and count your self. There is int getAvailableWithoutFetching() and isFullyFetched() that could help you.

In general if you really need a count - maintain it yourself.

On the other hand, if you have really many rows in one partition you can have also some other performance problems.

But that's hard to say without knowing the data model.

0
votes

Maybe you want to use "counter table" in addition to your dataset.

Pros: get counter fast.

Cons: need to maintained that table.

Reference: https://docs.datastax.com/en/cql/3.3/cql/cql_using/useCountersConcept.html