How to delete a large number of rows in Cassandra (and avoid potential tombstone issues)?

Question

Oversimplifying the data model, we have the following tables:

CREATE TABLE storage (
    id timeuuid,
    foo blob,
    bar blob,
    baz blob,
    data blob,
    PRIMARY KEY ((id))
);

CREATE TABLE storage_idx_by_foo (
    foo blob,
    id timeuuid,
    PRIMARY KEY ((foo), id)
);

CREATE TABLE storage_idx_by_bar (
    bar blob,
    id timeuuid,
    PRIMARY KEY ((bar), id)
);

CREATE TABLE storage_idx_by_baz (
    baz blob,
    id timeuuid,
    PRIMARY KEY ((baz), id)
);

The first table can contain hundreds of millions of records, and we use the index tables to easily locate data based on some queryable parameters.

The problem comes when we have to purge data based on either foo, bar or baz. We have to delete the entry from the storage table and all the index tables. So, assuming we delete by e.g. foo, steps taken are:

Find the id based on the appropriate index table (in this case storage_idx_by_foo)
Get the bar and baz and delete the record from the storage table
Delete records from the remaining two index tables (we have bar/baz and id)

Step number 3 is a problem because of tombstones - if we delete millions of records from the two remaining index tables (meaning not by partition), Cassandra will create millions of tombstones which will cause a lot of headaches when reading data before compaction happens.

Some quick brainstorming suggests we can:

Force a compaction after the purging process
Not delete from those two tables and handle index entries pointing to non-existing stuff in code
????

What is the suggested approach? I am guessing other Cassandra users have encountered this issue as well, but I couldn't find any advice online other than "you are doing Cassandra wrong". I don't think we could have modelled our data differently to avoid this problem (or if we could, I would appreciate feedback on that as well).

Currently, we are leaning towards option number 2, although I do not like the idea of garbage being left in the database.

ashic ashic · Accepted Answer · 2014-12-29T10:59:24

"You might be doing cassandra wrong"!!

What are your queries? Trying to generalise without knowing queries usually results in poor models in Cassandra. Modelling should really be query driven. Even if you don't know the exact queries, you should know the kinds of queries (i.e. what are you indexing on, etc.).

If you know you're going to index on foo, bar, and baz think of whether or not you can add some constraint that can serve as the partition key. For the schema you've outlined, one major problem you'll run in to is if there are tonnes of entries for the same foo value (or bar value or baz value). While in theory, a partition can be quite large, having partitions larger than tens of megs or a hundred or so megs is going to be bad for performance. As such, when doing wide rows, think of how you can limit wide row size. If each foo or bar or baz has a few hundred to a few thousand entries, this won't be an issue. Otherwise, you're asking for trouble. In this case, you may want to add some for of bucketing. For example, see if you can limit your queries to "get me data for foo x on this date" or "get me data for foo x for this country / postcode / etc.". This will prevent monstrous wide rows.

Another problem with manual indexing is that index updates are not atomic, and the index may be on a different node to the actual data. If you can limit the queries to buckets, your schema could look like:

CREATE TABLE storage (
    some_bucket text,
    id timeuuid,
    foo blob,
    bar blob,
    baz blob,
    data blob,
    PRIMARY KEY (somebucket, id)
);

or even you could leave storage as is and have your index as:

CREATE TABLE storage (
    bucket text,
    foo blob,
    bar blob,
    baz blob,
    data blob,
    PRIMARY KEY (bucket)
);

In both cases, you'd then create a cassandra secondary index on foo, bar and baz. That will allow your queries. Remeber, when using a secondary index, always hit a partition first - otherwise it becomes a cluster wide query that'll likely timeout. In Cassandra 3.0, a feature called global indexes is coming, which aims to alleviate the need for hitting a partition first, but until then, hit partition + secondary index and your queries will be fast.

Now...on to the topic of tombstones. Cassandra deletes will use tombstones. There's no way around that. Any LSM db will need compaction, and tombstones are cassandra's mechanism of achieving steady write throughput (almost) regardless of load. There are some things you can do though. If you can limit when such massive deletes will take place, you can use nodetool to disable autocompaction:

http://www.datastax.com/documentation/cassandra/2.1/cassandra/tools/toolsDisableAutoCompaction.html

You can then do the purge, and then force a compaction:

http://www.datastax.com/documentation/cassandra/2.1/cassandra/tools/toolsCompact.html

And enable autocompaction again.

This is obviously not "neat", but will work if you have large amounts of data you're deleting from a table, but not all of it.

Hope that helps.

How to delete a large number of rows in Cassandra (and avoid potential tombstone issues)?

1 Answers