0
votes

I have estimated ~500 million rows data with 5 million unique numbers. My query must get data by number and event_date. number as partition key, there will be 5 million partitions. I think it is not good that exists a lot of small partitions and timeouts occurs during query. I'm in trouble with defining partition key. I have found some synthetic sharding strategies, but couldn't apply for my model. I can define partition key by mod number, but then rows aren't distributed balanced among partitions.

How can I model this for reducing or is it necessary to reducing partition count? Is there any partition count limit?

CREATE TABLE events_by_number_and_date (
  number bigint,
  event_date int, /*eg. 20200520*/
  event text,
  col1 int,
  col2 decimal
  PRIMARY KEY (number, event_date)
);
1
what queries are executed? Most probably, the timeouts are not from the number of partitionsAlex Ott
only select count(*) events_by_number_and_date; throws Cassandra timeout during read query at consistency LOCAL_ONE (1 responses were required but only 0 replica responded)nikli

1 Answers

1
votes

For your query, change of the data model won't help, as you're using the query that is unsuitable for Cassandra. Although Cassandra supports aggregations, such as, max, count, avg, sum, ..., they are designed for work inside single partition, and not designed to work in the whole cluster. If you issue them without restriction on the partition key, coordinating node, need to reach every node in the cluster, and they will need to go through all the data in the cluster.

You can still do this kind of query, but it's better to use something like Spark to do that, as it's heavily optimized for parallel data processing, and Spark Cassandra Connector is able to correctly perform querying of the data. If you can't use Spark, you can implement your own full token range scan, using code similar to this. But in any case, don't expect that there will be a "real-time" answer (< 1sec).