I'm inserting into a Cassandra 3.12 via the Python (DataStax) driver and CQL BatchStatements [1]. With a primary key that results in a small number of partitions (10-20) all works well, but data is not uniformly distributed across nodes.
If I include a high cardinality column, for example time or client IP in addition to date, the batch inserts result in a Partition Too Large error, even though the number of rows and the row length is the same.
Higher cardinality keys should result in more but smaller partitions. How does a key generating more partitions result in this error?
[1] Although everything I have read suggests that batch inserts can be an anti-pattern, with a batch covering only one partition, I still see the highest throughput compared to async or current inserts for this case.
CREATE TABLE test
(
date date,
time time,
cid text,
loc text,
src text,
dst text,
size bigint,
s_bytes bigint,
d_bytes bigint,
time_ms bigint,
log text,
PRIMARY KEY ((date, loc, cid), src, time, log)
)
WITH compression = { 'class' : 'LZ4Compressor' }
AND compaction = {'compaction_window_size': '1',
'compaction_window_unit': 'DAYS',
'class': 'org.apache.cassandra.db.compaction.TimeWindowCompactionStrategy'};