I have worked with cassandra for quite some time (DSE) and am trying to understand something that isn't quite clear. We're running DSE 5.1.9 for this illustration. It's a single node cluster (If you have a multi-node cluster, ensure RF=nodeCount to make things easier).
It's very simple example: Create the following simple table:
CREATE TABLE mytable (
status text,
process_on_date_time int,
PRIMARY KEY (status, process_on_date_time)
) WITH CLUSTERING ORDER BY (process_on_date_time ASC)
AND gc_grace_seconds = 60
I have a piece of code that inserts 5k records at a time up to 200k total records with TTL of 300 seconds. The status is ALWAYS "pending" and the process_on_date_time is a counter that increments by 1, starting at 1 (all unique records - 1 - 200k basically).
I run the code and then once it completes, I flush the memtable to disk. There's only a single sstable created. After this, no compaction, no repair, nothing else runs that would create or change the sstable configuration.
After the sstable dump, I go into cqlsh, turn on tracing, set consistency to LOCAL_ONE and paging off. I then run this repetitively:
SELECT * from mytable where status = 'pending' and process_on_date_time <= 300000;
What is interesting is I see things like this (cutting out some text for readability):
Run X) Read 31433 live rows and 85384 tombstone cells (31k rows returned to my screen)
Run X+1) Read 0 live rows and 76376 tombstone cells (0 rows returned to my screen - all rows expired at this point)
Run X+2) Read 0 live rows and 60429 tombstone cells
Run X+3) Read 0 live rows and 55894 tombstone cells
...
Run X+X) Read 0 live rows and 0 tombstone cells
What is going on? The sstable isn't changing (obviously as it's immutable), nothing else inserted, flushed, etc. Why is the tombstone count decreasing until it's at 0? What causes this behavior?
I would expect to see every run: 100k tombstones read and the query aborting as all TTL have expired in the single sstable.