4
votes

I currently have an application that persists event driven real time streaming data to a column family which is modeled as such:

CREATE TABLE current_data (
    account_id text, 
    value text,
    PRIMARY KEY (account_id)
)

Data is being sent every X seconds per accountId, so we overwrite an existing row every time we receive an event. This data contains current real time information, and we only care about the most recent event (no use for older data, that is why we insert over an already existing key). From the application user end - we query a select by account_id statement.

I was wondering if there is a better way to model this behaviour and was looking at Cassandra's best practices and similar questions asked (How to model Cassandra DB for Time Series, server metrics).

Thought about something like this:

CREATE TABLE current_data_2 (
        account_id text, 
        time timeuuid,      
        value text,
        PRIMARY KEY (account_id, time) WITH CLUSTERING ORDER BY (time DESC)
)

No overwrites will occur, and each insertion will also be done with a TTL (can be a TTL of a few minutes).

The question is HOW better, if at all, is the second data model over the first one. From what I understand, the main advantage will be in the READS - since the data is ordered by time all I need to do is a simple

SELECT * FROM metrics WHERE account_id = <id> LIMIT 1

while in the first data model Cassandra actually reads ALL rows that where overwritten the same key and then chooses the last one by its write timestamp (please correct me if I'm wrong).

Thanks.

1
If you only needed the most recent event, then first one is the bestAshraful Islam
could you be more specific please? isn't the second one preferable on reads?fncontroloptioncommand

1 Answers

2
votes

First of all I encourage you to examine the official documentation about read path.

data is ordered by time

This is only true in your second case, when Cassandra reads a single SSTable and MemTable (check the flow diagram).

Cassandra actually reads ALL rows that where overwritten the same key and then chooses the last one by its write timestamp

This happens at the Merge Cells by Timestamp step in the documentation (again check the flow diagram). Notice, that in each SSTable the number of rows will be one in your first case.

In both of your cases the main driving factor is that how many SSTables do you have to check during read. It's somewhat independent from how many records each SSTable contains.

But on the second case you have much bigger SSTabes which leads to longer SSTable compaction. Also TTL expiration performs additional writes. So first case is somewhat preferable.