1
votes

Our use case with Cassandra is to show top 10 recent visitors of a blogpost. Following is the Cassandra table definition

CREATE TABLE blogs_by_visitor (
             blogposturl text,
             visitor text,
             visited_ts timestamp,
             PRIMARY KEY (blogposturl, visitor)
           );

Now in order to show top 10 recent visitors for a given blogpost, there needs to be an explicit "order by" clause on timestamp desc. Since visted_ts isn't part of the clustering column in Cassandra, we aren't able to get this done. The reason for visited_ts not being part of clustering column is to avoid recording repeat (read as duplicate) visitors. The primary key is designed in such a way to upsert the latest timestamp for a repeat visitor.

In RDBMS world the query would look like the following and a secondary index could be created with blogposturl and timestamp columns.

Select visitor from blog_table
where 
blogposturl = ?
and rownum <= 10
order by timestamp desc

An alternative currently being followed in our Cassandra application, is to obtain the results and then sort based on timestamp on the app side. But what if a particular blogpost becomes so popular and it had more than 100,000 visitors. The query becomes really slow for those blogs.

I'm thinking secondary index wouldn't be useful here, as I don't worry about filtering on it (rather just for sorting - which isn't possible).

Any idea on how we could model the table differently?

The actual table has additional columns, reduced it here for simplicity

1

1 Answers

2
votes

These type of job are done by Apache Spark or Hadoop. A schedule job which compute the unique visitor order by timestamp for each url and store the result into cassandra.

Or you can create a Materialized View on top of the blogs_by_visitor. This table will make sure of unique visitor and the materialized view will oder the result based on visited_ts timestamp.

Let's create the Materialized View :

CREATE MATERIALIZED VIEW unique_visitor AS
    SELECT *
    FROM blogs_by_visitor
    WHERE blogposturl IS NOT NULL AND visitor IS NOT NULL AND visited_ts IS NOT NULL
    PRIMARY KEY (blogposturl, visited_ts, visitor)
    WITH CLUSTERING ORDER BY (visited_ts DESC, visitor ASC);

Now you can just select the 10 recent unique visitor of a blogpost.

SELECT * FROM unique_visitor WHERE blogposturl = ? LIMIT 10;

you can see that i haven't specify the sort order in select query. Because in the materialized view schema a have specified default sort order visited_ts DESC

Note That : The above schema will result huge amount of unexpected tombstone generation in the Materialized Views

Or You could change your table schmea like below :

CREATE TABLE blogs_by_visitor (
     blogposturl text,
     year int,
     month int,
     day int,
     visitor text,
     visited_ts timestamp,
     PRIMARY KEY ((blogposturl, year, month, day), visitor)
);

Now you have only a small amount of data in a single partition.So you can sort all the visitor based on visited_ts in that single partition from the client side. If you think number of visitor in a day can be huge then add hour to the partition key also.