3
votes

I'm new to cassandra and would like to ask what would be correct model design pattern for such tasks. I would like to model data with future removal possibility.

I have 100,000,000 records per day of this structure:

  • transaction_id <- this is unique
  • transaction_time
  • transaction_type
  • user_name
  • ... some other information

I will need to fetch data by user_name (I have about 5,000,000 users). Also I will need to find transaction details by its id. All the data will be irrelevant after say about 30 days, so need to find a way to delete outdated rows.

As much I have found, TTL-s expire column values, not rows. So far I came across with this model, and as I understand it will imply really wide rows:

CREATE TABLE user_transactions (
  transaction_date timestamp,    //date part of transactiom
  user_name text,
  transaction_id text,
  transaction_time timestamp,    //original transaction time
  transaction_type int,
  PRIMARY KEY ((transaction_date, user_name), transaction_id)
);

CREATE INDEX idx_user_transactions_uname ON USER_TRANSACTIONS(user_name);
CREATE INDEX idx_user_transactions_tid ON USER_TRANSACTIONS(transaction_id);

but this model does not allow deletions by transaction_date. this also builds indexes with high cardinality, what cassandra docs strongly discourages

So what will be the correct model for this task?

EDIT:

Ugly workaround I came with so far is to create single table per date partition. Mind you, I call this workaround and not a solution. I'm still looking for right data model

CREATE TABLE user_transactions_YYYYMMDD (
  user_name text,
  transaction_id text,
  transaction_time timestamp, 
  transaction_type int,
  PRIMARY KEY (user_name)
);

YYYYMMDD is date part of transaction. we can create similar table with transaction_id for transaction lookup. obsolete tables can be dropped or truncated.

2

2 Answers

0
votes

Maybe you should denormalized your data model. For example to query by user_name you can use a cf like this:

CREATE TABLE user_transactions (
  transaction_date timestamp,    //date part of transactiom
  user_name text,
  transaction_id text,
  transaction_time timestamp,    //original transaction time
  transaction_type int,
  PRIMARY KEY (user_name, transaction_id)
);

So you can query using the partition key directly like this:

SELECT * FROM user_transactions WHERE user_name = 'USER_NAME';

And for the id you can use a cf like this:

CREATE TABLE user_transactions (
  transaction_date timestamp,    //date part of transactiom
  user_name text,
  transaction_id text,
  transaction_time timestamp,    //original transaction time
  transaction_type int,
  PRIMARY KEY (transaction_id)
);

so the query could be something like this:

SELECT * FROM user_transactions WHERE transaction_id = 'ID';

By this way you dont need indexes.

About the TTL, maybe you could programatically ensure that you update all the columns in the row at the same time (same cql sentence).

0
votes

Perhaps my answer will be a little useful.

I would have done so:

CREATE TABLE user_transactions (
  date timestamp,
  user_name text,
  id text,
  type int,
  PRIMARY KEY (id)
);

CREATE INDEX idx_user_transactions_uname ON user_transactions (user_name);

No need in 'transaction_time timestamp', because this time will be set by Cassandra to each column, and can be fetched by WRITETIME(column name) function. Because you write all the columns simultaneously, then you can call this function on any column.

INSERT INTO user_transactions ... USING TTL 86400;

will expire all columns simultaneously. So do not worry about deleting rows. See here: Expiring columns.

But as far as I know, you can not delete an entire row - key column still remains, and in the other columns will be written NULL. If you want to delete the rows manually, or just want to have an estimate of rows to be deleted by a TTL, then I recommend driver Astyanax: AllRowsReader All rows query.

And indeed as a driver to work with Cassandra I recommend you use Astyanax.