9
votes

I am considering Cassandra as an intermediate storage during my ETL job to perform data deduplication.

Let's imagine I have a stream of events, each of them have some business entity id, timestamp and some value. I need to get only latest value in terms of in-event timestamp for each business key, but events may come unordered.

My idea was to create staging table with business id as a partition key and timestamp as a clustering key:

CREATE TABLE sample_keyspace.table1_copy1 (
 id uuid,
 time timestamp,
 value text,
 PRIMARY KEY (id, time)
) WITH CLUSTERING ORDER BY ( time DESC )

Now if I insert some data in this table I can get latest value for some given partition key:

select * from table1 where id = 96b29b4b-b60b-4be9-9fa3-efa903511f2d limit 1;

But that would require to issue such query for every business key I'm interested in.

Is there some effective way I could do it in CQL?

I know we have an ability to list all available partition keys (by select distinct id from table1). So if I look into storage model of Cassandra, getting first row for each partition key should not be too hard.

Is that supported?

2

2 Answers

21
votes

If you're using a version after 3.6, there is an option on your query named PER PARTITION LIMIT (CASSANDRA-7017) which you can set to 1. This won't auto complete in cqlsh until 3.10 with CASSANDRA-12803.

SELECT * FROM table1 PER PARTITION LIMIT 1;
0
votes

In a word: no.

The partitioning key is why Cassandra can work essentially any amount of data: It decides where to put/look for data using the hash of the partitioning key. That is why CQL SELECTs always need to do an equality filter on the entire partitioning key. In order to find the first time for each id, Cassandra would have to ask all nodes for any partition of the data, then perform a complex operation on each of them. Relational databases allow this, Cassandra does not. All it allows are full table scans (SELECT * from table1), or partition scans (SELECT DISTINCT id FROM table1), but those cannot* be linked to any complex operation.

*) I am omitting ALLOW FILTERING here, since it does not help in this context.