Cassandra data model with obsolete data removal possibility

Question

I'm new to cassandra and would like to ask what would be correct model design pattern for such tasks. I would like to model data with future removal possibility.

I have 100,000,000 records per day of this structure:

transaction_id <- this is unique
transaction_time
transaction_type
user_name
... some other information

I will need to fetch data by user_name (I have about 5,000,000 users). Also I will need to find transaction details by its id. All the data will be irrelevant after say about 30 days, so need to find a way to delete outdated rows.

As much I have found, TTL-s expire column values, not rows. So far I came across with this model, and as I understand it will imply really wide rows:

CREATE TABLE user_transactions (
  transaction_date timestamp,    //date part of transactiom
  user_name text,
  transaction_id text,
  transaction_time timestamp,    //original transaction time
  transaction_type int,
  PRIMARY KEY ((transaction_date, user_name), transaction_id)
);

CREATE INDEX idx_user_transactions_uname ON USER_TRANSACTIONS(user_name);
CREATE INDEX idx_user_transactions_tid ON USER_TRANSACTIONS(transaction_id);

but this model does not allow deletions by transaction_date. this also builds indexes with high cardinality, what cassandra docs strongly discourages

So what will be the correct model for this task?

EDIT:

Ugly workaround I came with so far is to create single table per date partition. Mind you, I call this workaround and not a solution. I'm still looking for right data model

CREATE TABLE user_transactions_YYYYMMDD (
  user_name text,
  transaction_id text,
  transaction_time timestamp, 
  transaction_type int,
  PRIMARY KEY (user_name)
);

YYYYMMDD is date part of transaction. we can create similar table with transaction_id for transaction lookup. obsolete tables can be dropped or truncated.

ftrujillo ftrujillo · Accepted Answer · 2014-04-29T07:58:00

Maybe you should denormalized your data model. For example to query by user_name you can use a cf like this:

CREATE TABLE user_transactions (
  transaction_date timestamp,    //date part of transactiom
  user_name text,
  transaction_id text,
  transaction_time timestamp,    //original transaction time
  transaction_type int,
  PRIMARY KEY (user_name, transaction_id)
);

So you can query using the partition key directly like this:

SELECT * FROM user_transactions WHERE user_name = 'USER_NAME';

And for the id you can use a cf like this:

CREATE TABLE user_transactions (
  transaction_date timestamp,    //date part of transactiom
  user_name text,
  transaction_id text,
  transaction_time timestamp,    //original transaction time
  transaction_type int,
  PRIMARY KEY (transaction_id)
);

so the query could be something like this:

SELECT * FROM user_transactions WHERE transaction_id = 'ID';

By this way you dont need indexes.

About the TTL, maybe you could programatically ensure that you update all the columns in the row at the same time (same cql sentence).

Cassandra data model with obsolete data removal possibility

2 Answers