Our use case with Cassandra is to show top 10 recent visitors of a blogpost. Following is the Cassandra table definition
CREATE TABLE blogs_by_visitor (
blogposturl text,
visitor text,
visited_ts timestamp,
PRIMARY KEY (blogposturl, visitor)
);
Now in order to show top 10 recent visitors for a given blogpost, there needs to be an explicit "order by" clause on timestamp desc. Since visted_ts isn't part of the clustering column in Cassandra, we aren't able to get this done. The reason for visited_ts not being part of clustering column is to avoid recording repeat (read as duplicate) visitors. The primary key is designed in such a way to upsert the latest timestamp for a repeat visitor.
In RDBMS world the query would look like the following and a secondary index could be created with blogposturl and timestamp columns.
Select visitor from blog_table
where
blogposturl = ?
and rownum <= 10
order by timestamp desc
An alternative currently being followed in our Cassandra application, is to obtain the results and then sort based on timestamp on the app side. But what if a particular blogpost becomes so popular and it had more than 100,000 visitors. The query becomes really slow for those blogs.
I'm thinking secondary index wouldn't be useful here, as I don't worry about filtering on it (rather just for sorting - which isn't possible).
Any idea on how we could model the table differently?
The actual table has additional columns, reduced it here for simplicity