4
votes

So we have a web application using .NET with Cassandra / Spark combo to produce online reports.

Currently we grab all relevant data from Cassandra and render it inside a table through a JavaScript plugin that also sorts it (depending on column clicked).

E.g.

PK = PARTITION KEY | CK = CLUSTERING KEY

   PK     PK         CK
-------------------------------------
| user | date  | application | time |
-------------------------------------
| A    | 17500 | app1        | 3000 |
| A    | 17500 | calc        | 1000 |
| A    | 17500 | word        | 5000 |
-------------------------------------

However the data coming back is becoming increasingly larger: so we needed to develop some sort of pagination to avoid long request and front-end loading times.
The column most likely users would sort by is time and unfortunately is not part of the Clustering key and therefore cannot use the ORDER BY command.

A solution we came up with was creating a 'ranking' table with the same data E.g.

   PK     PK      CK
--------------------------------------------
| user | date  | rank | application | time |
--------------------------------------------
| A    | 17500 | 1    | word        | 5000 |
| A    | 17500 | 2    | app1        | 3000 |
| A    | 17500 | 3    | calc        | 1000 |
--------------------------------------------

...but this would put a lot more load onto Spark as the data gathered for 'time' is constantly incrementing and therefore changing the rank.

We could also order the results server-side, cache and retrieve limited data through ajax calls, but this method significantly increases memory load on the server (especially if many users are using the system at once).

Perhaps I'm overthinking this and there is a simple cassandra table construction could be used instead. What would be the best way to solve this problem?


EDIT (15th Dec 2017): Came accross something in Cassandra called Materialized Views which seems to be able to order non-keyed columns as clustering keys. This would be great for grabbing top number of rows and sorting but not pagination.


EDIT (18th Dec 2017): The Datastax C# driver allows for pagination of results returned. The paging state can be saved and continued when needed. This together with the Materialized views would complete pagination.


EDIT (19th Dec 2017) Haven't really delved into the pit of dataframes through spark - once setup they are incredibly fast to sort and filter on - treating it like SQL.
Key words: once setup. Found they took an average of around 7 seconds to create.


EDIT (29 Mar 2018) Hit a snag with the current solution (Materialized View + limit results). The Materialized View needs to constantly update resulting in craploads of tombstones. This means: bad cluster performance.
See Sorting Results by Non-Clustering Key and Tombstones When Updating.
Back to square 1. sigh


EDIT (22 Aug 2018) Through vigorous research: it appears the way to go is implementing a Solr solution. Solr allows for powerful & fast indexed searches as well as paging. This blog post 'Avoid pitfalls in scaling Cassandra' is a good resource from Walmart's dev that explains a solution on how they did paging using 'sharding'.

1

1 Answers

0
votes

Been a good while since asking this question but wanted to post some information about the current solution.

Partition keys are 'key'.

Designing the database so only the data you want returned is returned.
Filtering to the exact partitionkey instead of also filtering on clustering keys improved performance of the cluster tremendously. We now only use 1 table with a single partition key instead of 100s of tables with composite keys. Sharding was also implemented.

KAFKA Data Streaming & Caching

One of the largest pitfalls we faced was just the huge amount of pressure the database had with our constant updating data, often inserting duplicate rows. This created problems with the size of memtables and flushing times which often saw nodes falling over. https://docs.datastax.com/en/archived/cassandra/3.0/cassandra/dml/dmlHowDataWritten.html
So we decided to change to streaming instead of batch processing (Spark).

Kafka streaming is so fast, no Cassandra querying is done until topics no longer need to kept in memory. Optimised Kafka topics stream to an intermediate caching system, sorts the data using Linq (C#), and keeps it there until a certain period of time has passed. Data is retrieved from this cache for paging.

Spark streaming would have also worked for us, but Kafka just fit better. Here's a good article on the difference and what might be better for you:
https://www.knowledgehut.com/blog/big-data/kafka-vs-spark