I'm struggling to work out the correct data model for a Cassandra column family/table which will enable me to query what I need.
I only need to store data for a single type of entity, representing a financial transaction. The main fields (applicable for queries, sorting):
- uuid: unique ID per transaction
- timestamp: time of transaction
- source_id: string (low cardinality, typically < 10 unique values across the data set)
- destination_id: string (low cardinality, typically < 10 unique values across the data set)
- merchant_id: string (high cardinality, ~100k unique values)
There are additional fields (approx 20 total), but these are not required for query or ordering.
I need to query transactions in the following two ways:
- By UUID
- By date range, for exact values of source_id/destination_id, ordered by merchant_id
The UUID query needs to be quick and, whilst the date range query should be as efficient as possible, it is run daily as an offline process so can take longer. It would be good to understand how the query time scales with no. records.
Cassandra will be setup as a cluster of 4 nodes, across 2 data centres, in case this affects the partition key configuration. I'm using Cassandra 2.0.
Being somewhat new to Cassandra, I've been reading about composite keys, timeuuid, slice queries etc but am a little confused. If anyone is able to suggest the right approach, or point me in the right direction, I'd be very grateful.
Thanks.