1
votes

I'm inserting into a Cassandra 3.12 via the Python (DataStax) driver and CQL BatchStatements [1]. With a primary key that results in a small number of partitions (10-20) all works well, but data is not uniformly distributed across nodes.

If I include a high cardinality column, for example time or client IP in addition to date, the batch inserts result in a Partition Too Large error, even though the number of rows and the row length is the same.

Higher cardinality keys should result in more but smaller partitions. How does a key generating more partitions result in this error?


[1] Although everything I have read suggests that batch inserts can be an anti-pattern, with a batch covering only one partition, I still see the highest throughput compared to async or current inserts for this case.


CREATE TABLE test ( date date, time time, cid text, loc text, src text, dst text, size bigint, s_bytes bigint, d_bytes bigint, time_ms bigint, log text, PRIMARY KEY ((date, loc, cid), src, time, log) ) WITH compression = { 'class' : 'LZ4Compressor' } AND compaction = {'compaction_window_size': '1', 'compaction_window_unit': 'DAYS', 'class': 'org.apache.cassandra.db.compaction.TimeWindowCompactionStrategy'};

1
What is the size of column that you did move into partition key?Alex Ott
@AlexOtt In the table above, replaced date (type date) with a timestamp type column. If I include both data and time columns I get the same issue. I referencing cardinality because the addition of time should result in many more partitions (because of resolution), but does not change row size. I don't believe I need that many partitions, but I'm trying to understand the relationship between the primary key and the error.kermatt

1 Answers

3
votes

I guess you meant Caused by: com.datastax.driver.core.exceptions.InvalidQueryException: Batch too large errors?

This is because of the parameter batch_size_fail_threshold_in_kb which is by default 50kB of data in a single batch - and there are also warnings earlier at a at 5Kb threshold through batch_size_warn_threshold_in_kb in cassandra.yml (see http://docs.datastax.com/en/archived/cassandra/3.x/cassandra/configuration/configCassandra_yaml.html).

Can you share your data model? Just adding a column doesnt mean the partition key to change - maybe you just changed the primary key only by adding a clustering column. Hint: PRIMARY KEY (a,b,c,d) uses only a as partition key, while PRIMARY KEY ((a,b),c,d) uses a,b as partition key - an easy overlooked mistake.

Apart from that, the additional column takes some space - so you can easily hit the threshold now, just reduce the batch size so it does fit again into the limits. In general it's a good way to batch only upserts the affect a single partition as you mentioned. Also make use of async queries and make parallel requests to different coordinators to gain some more speed.