I have estimated ~500 million rows data with 5 million unique numbers. My query must get data by number
and event_date
. number
as partition key, there will be 5 million partitions. I think it is not good that exists a lot of small partitions and timeouts occurs during query. I'm in trouble with defining partition key. I have found some synthetic sharding strategies, but couldn't apply for my model. I can define partition key by mod number, but then rows aren't distributed balanced among partitions.
How can I model this for reducing or is it necessary to reducing partition count? Is there any partition count limit?
CREATE TABLE events_by_number_and_date (
number bigint,
event_date int, /*eg. 20200520*/
event text,
col1 int,
col2 decimal
PRIMARY KEY (number, event_date)
);
select count(*) events_by_number_and_date;
throwsCassandra timeout during read query at consistency LOCAL_ONE (1 responses were required but only 0 replica responded)
– nikli