0
votes

I'm evaluating the insertion process on Apache Cassandra 2.0.14. I'm using a benchmark tool called YCSB that It's sending 1 record per second to a single Cassandra cluster with 1 node.

In each record I check the Memtable data size with Nodetool (command cfstats) and I realize that Memtable data size growing up proportionally until the 29th record. However, on the 30th record, the Memtable data size isn't proportional like the latest records. Check the results below:

N of Records: (1, 10, 25, 30)

Memtable Data size (bytes): (11810, 118100, 295250, 217614)

Proportionality in relation to 1st: (-, 10, 25, 18.43*)

*: should be 30

Why is this happening?

There isn't flush process until the 30th record.

Some properties in cassandra.yaml:

memtable_total_space_in_mb: 10

memtable_flush_writers: 1

memtable_flush_queue_size: 4
1

1 Answers

1
votes

Just to start with, 2.0.14 is very old and these settings (I assume are just for this test?) are far from optimal. I highly recommend at least using 2.1 but you should consider 3.11 for a number of reasons including the accuracy of this metric. After 2.1 this calculation is different.

Make sure jamm agent is running or it will make the memtable size metric very inaccurate. It is used to calculate the deep size of the memtable.

Every time a mutation is applied, it will decide if it should recalculate the live ratio. Every 10x operations from last time it was calculated for each table. This is kicked off asynchronously to the MemoryMeter thread pool, and does not block the insertion of the mutation. When this runs it will find the actual "deep size" of the memtable including JVM overhead. This is compared to the running assumed size of the memtable to find the liveRatio.

To calculate the estimate of the current live memtable space the last computed live ratio is multiplied by the current size of the memtable. This is a very rough estimate and has a few bounds since some kinds of data (ie tombstones) have much different footprints as others.

In 2.1 and 3.0 you can expect this metric to be more consistent with expectations (maybe still not perfect though) but in 2.0 the memtable data size is a rough heuristic for determining when to flush and shouldnt be expected to be (easily) deterministic. If nothing else from the async nature of the liveRatio updates.