0
votes

I want to store info about some events in Cassandra. Events have different groups and also grouped by time interval (group id = partition key, interval = clustering key). Events has id and inside every group I want to store only events with unique id inside this group. I think to use sets for it and store event id in them. Something like this:

group id (PK) | time (CK) | event ids
1             | 13:00     | {0, 2, 4, 5}
1             | 14:00     | {1, 3}
1             | 15:00     | {}
2             | 13:00     | {}
2             | 14:00     | {2, 4}

When I do select request I want to get events count for special group inside some time range. It will be next for table above and group with id 1 for time range 13:00 - 15:00:

13:00 - 4
14:00 - 2
15:00 - 0

I can select all events sets for group 1 for time range 13:00 - 15:00 and calculate their side. It will works but events set can be large enough and I don't need info about event ids (I store it only for uniqueness), only their size. Can I get sets sizes on Cassandra side using CQL?

1

1 Answers

1
votes

Don't use collection for huge data

Collection (Set): collection size: 2B (231); values size: 65535 (216-1) (Cassandra 2.1 and later, using native protocol v3)

Instead put event_id in the primary key.

CREATE TABLE events(
    group_id bigint,
    time bigint,
    event_id bigint,
    PRIMARY KEY(group_id,time,event_id)
);

You can insert data like this one :

INSERT INTO events (group_id , time , event_id ) VALUES ( 1, 13, 0);

And you can query like this one :

SELECT * FROM events WHERE group_id = 1;

It will return all the event in a group.

group_id | time | event_id
----------+------+----------
        1 |   13 |        0
        1 |   13 |        1
        1 |   14 |        2

Use Spark or Write program to Find the group by count.

Or use any one of these query to get count.

SELECT group_id,time,count(*) FROM events WHERE group_id = 1 AND time = 13; // To count in a group and time
SELECT group_id,time,count(*) FROM events WHERE group_id = 1 AND time >= 13 AND time <= 14; // To count in a group between time 13 to 14.

Source : https://docs.datastax.com/en/cql/3.1/cql/cql_reference/refLimits.html