Cassandra tombstones with TTL

Question

I have worked with cassandra for quite some time (DSE) and am trying to understand something that isn't quite clear. We're running DSE 5.1.9 for this illustration. It's a single node cluster (If you have a multi-node cluster, ensure RF=nodeCount to make things easier).

It's very simple example: Create the following simple table:

CREATE TABLE mytable (
    status text,
    process_on_date_time int,
    PRIMARY KEY (status, process_on_date_time)
) WITH CLUSTERING ORDER BY (process_on_date_time ASC)
AND gc_grace_seconds = 60

I have a piece of code that inserts 5k records at a time up to 200k total records with TTL of 300 seconds. The status is ALWAYS "pending" and the process_on_date_time is a counter that increments by 1, starting at 1 (all unique records - 1 - 200k basically).

I run the code and then once it completes, I flush the memtable to disk. There's only a single sstable created. After this, no compaction, no repair, nothing else runs that would create or change the sstable configuration.

After the sstable dump, I go into cqlsh, turn on tracing, set consistency to LOCAL_ONE and paging off. I then run this repetitively:

SELECT * from mytable where status = 'pending' and process_on_date_time <= 300000;

What is interesting is I see things like this (cutting out some text for readability):

Run X) Read 31433 live rows and 85384 tombstone cells (31k rows returned to my screen) 
Run X+1) Read 0 live rows and 76376 tombstone cells (0 rows returned to my screen - all rows expired at this point) 
Run X+2) Read 0 live rows and 60429 tombstone cells 
Run X+3) Read 0 live rows and 55894 tombstone cells 
... 
Run X+X) Read 0 live rows and 0 tombstone cells

What is going on? The sstable isn't changing (obviously as it's immutable), nothing else inserted, flushed, etc. Why is the tombstone count decreasing until it's at 0? What causes this behavior?

I would expect to see every run: 100k tombstones read and the query aborting as all TTL have expired in the single sstable.

Did you try an SSTabledump to actually check the contents in the SSTable before trying the first select call ? with that you can know how many actual tombstones it really have. — M P

Jim Wartnick Jim Wartnick · Accepted Answer · 2019-03-01T16:01:43

For anyone else who may be curious to this answer, I opened a ticket with Datastax, and here is what they mentioned:

After the tombstones pass the gc_grace_seconds they will be ignored in result sets because they are filtered out after they have past that point. So you are correct in the assumption that the only way for the tombstone warning to post would be for the data to be past their ttl but still within gc_grace.

And since they are ignored/filtered out they wont have any harmful effect on the system since like you said they are skipped.

So what this means is that if TTLs expire, but are within the GC Grace Seconds, they will be counted as tombstones when queried against. If TTLs expire AND GC Grace Seconds also expires, they will NOT be counted as tombstones (skipped). The system still has to "weed" through the expired TTL records, but other than processing time, are not "harmful" for the query. I found this very interesting as I don't see this documented anywhere.

Thought others may be interested in this information and could add to it if their experiences differ.

Cassandra tombstones with TTL

1 Answers