So we have a web application using .NET with Cassandra / Spark combo to produce online reports.
Currently we grab all relevant data from Cassandra and render it inside a table through a JavaScript plugin that also sorts it (depending on column clicked).
E.g.
PK = PARTITION KEY | CK = CLUSTERING KEY
PK PK CK
-------------------------------------
| user | date | application | time |
-------------------------------------
| A | 17500 | app1 | 3000 |
| A | 17500 | calc | 1000 |
| A | 17500 | word | 5000 |
-------------------------------------
However the data coming back is becoming increasingly larger: so we needed to develop some sort of pagination to avoid long request and front-end loading times.
The column most likely users would sort by is time and unfortunately is not part of the Clustering key and therefore cannot use the ORDER BY
command.
A solution we came up with was creating a 'ranking' table with the same data E.g.
PK PK CK
--------------------------------------------
| user | date | rank | application | time |
--------------------------------------------
| A | 17500 | 1 | word | 5000 |
| A | 17500 | 2 | app1 | 3000 |
| A | 17500 | 3 | calc | 1000 |
--------------------------------------------
...but this would put a lot more load onto Spark as the data gathered for 'time' is constantly incrementing and therefore changing the rank.
We could also order the results server-side, cache and retrieve limited data through ajax calls, but this method significantly increases memory load on the server (especially if many users are using the system at once).
Perhaps I'm overthinking this and there is a simple cassandra table construction could be used instead. What would be the best way to solve this problem?
EDIT (15th Dec 2017): Came accross something in Cassandra called Materialized Views which seems to be able to order non-keyed columns as clustering keys. This would be great for grabbing top number of rows and sorting but not pagination.
EDIT (18th Dec 2017): The Datastax C# driver allows for pagination of results returned. The paging state can be saved and continued when needed. This together with the Materialized views would complete pagination.
EDIT (19th Dec 2017)
Haven't really delved into the pit of dataframes through spark - once setup they are incredibly fast to sort and filter on - treating it like SQL.
Key words: once setup. Found they took an average of around 7 seconds to create.
EDIT (29 Mar 2018)
Hit a snag with the current solution (Materialized View + limit results). The Materialized View needs to constantly update resulting in craploads of tombstones. This means: bad cluster performance.
See Sorting Results by Non-Clustering Key and Tombstones When Updating.
Back to square 1. sigh
EDIT (22 Aug 2018) Through vigorous research: it appears the way to go is implementing a Solr solution. Solr allows for powerful & fast indexed searches as well as paging. This blog post 'Avoid pitfalls in scaling Cassandra' is a good resource from Walmart's dev that explains a solution on how they did paging using 'sharding'.