1
votes

I'm struggling to work out the correct data model for a Cassandra column family/table which will enable me to query what I need.

I only need to store data for a single type of entity, representing a financial transaction. The main fields (applicable for queries, sorting):

  1. uuid: unique ID per transaction
  2. timestamp: time of transaction
  3. source_id: string (low cardinality, typically < 10 unique values across the data set)
  4. destination_id: string (low cardinality, typically < 10 unique values across the data set)
  5. merchant_id: string (high cardinality, ~100k unique values)

There are additional fields (approx 20 total), but these are not required for query or ordering.

I need to query transactions in the following two ways:

  1. By UUID
  2. By date range, for exact values of source_id/destination_id, ordered by merchant_id

The UUID query needs to be quick and, whilst the date range query should be as efficient as possible, it is run daily as an offline process so can take longer. It would be good to understand how the query time scales with no. records.

Cassandra will be setup as a cluster of 4 nodes, across 2 data centres, in case this affects the partition key configuration. I'm using Cassandra 2.0.

Being somewhat new to Cassandra, I've been reading about composite keys, timeuuid, slice queries etc but am a little confused. If anyone is able to suggest the right approach, or point me in the right direction, I'd be very grateful.

Thanks.

1

1 Answers

2
votes

Query 1 is fairly straightforward, in CQL3, this looks like:

create table tx_by_uuid (
tx_id uuid PRIMARY KEY,
tx_time timestamp,
source_id text,
dest_id text,
merchant_id text
) 

Query 2 is a bit more complex: (Note, I'm assuming here that source_id/destination_id refers to requirement that both are specified, and not and either/or situation)

Since we want to query by exact values of the source_id/destination_id we ideally want a primary key that uses a combination of those columns as our partition key. Since those are low cardinality columns, we might run into a problem with lumpy partitioning, but since you have given us some extra information about the number of nodes, we can convince ourselves that source_id+destination_id is an acceptable partition key for this table by observing that if each of those columns has about 10 distinct values, when combined, there will be about 100 distinct values. In a larger cluster, this would not be ideal data distribution and we might want to add a temporal bucket to partition key, but for a four node cluster, we might be alright here.

Since we also want to query by data range, we will want to include the tx_time column as a clustering column. In addition, since we want to see the results ordered by the merchant, that will also need to be a clustering column. Finally, to ensure that the primary key uniquely identifies they individual transactions, the tx_id should be the last clustering column. So the table for query 2 might look like:

create table tx_by_merchant(
tx_id uuid,
tx_time timestamp,
source_id text,
dest_id text,
merchant_id text,
primary key((source_id, dest_id), tx_time, merchant_id, tx_id)
)

Note that these partitions could be extremely wide, so it might be a good idea to add a year+month supplementary column to the partition key to break those up a bit. Also, if you need to reverse the ordering, have a look at the WITH CLUSTERING ORDER clause.

Queries that retrieve a row uniquely identified by a single primary key are very fast. Queries that fetch from very wide partitions can be somewhat slower, but it will still be faster than trying to accomplish the same thing with a secondary index.