0
votes

Cassandra Newbie here. Cassandra v 3.9.

I'm modelling the Travellers Flight Checkin Data.

My Main Query Criteria is Search for travellers with a date range (max of 7 day window).

Here is what I've come up with with my limited exposure to Cassandra.

create table IF NOT EXISTS travellers_checkin (checkinDay text, checkinTimestamp bigint, travellerName text, travellerPassportNo text, flightNumber text, from text, to text, bookingClass text, PRIMARY KEY (checkinDay, checkinTimestamp)) WITH CLUSTERING ORDER BY (checkinTimestamp DESC)

Per day, I'm expecting upto a million records - resulting in the partition to have a million records.

Now my users want search in which the date window is mandatory (max a week window). In this case should I use a IN clause that spans across multiple partitions? Is this the correct way or should I think of re-modelling the data? Alternatively, I'm also wondering if issuing 7 queries (per day) and merging the responses would be efficient.

1
Uh-oh. You will put all checkins per day on a single partition - this will become a temporary hotspot in your cluster and not scale very well. Any restriction you could put into the parititon key - maybe flight number as those should be pre defined? - Mandraenke
Merging your result on client side form a bunch of async smaller queries will give good performance as work will be distributed around your cluster. - Mandraenke
I can add more fields to the Partition Key to thin down the rows. But in the worst case, the users might simply search by not giving any filters on the UI, which puts me back in the original problem space. Only the date range is mandatory by default. 1M per partition is what the business foresees. Is this too big a number to become a hotspot? - user1189332
The hotspot arises from the fact that choosing checkinDay will put all data to a single node (and it's replicas) and not spread around your cluster. Throughout the day all writes and reads will hit the same nodes, while the others are idling. So adding the flightnumber (if that is predictable) will distribute read and writes around your cluster through a single day - if you have no restriction of course you will need to query all flight numbers - that can be done async. Then merge the result in your code - that often will be faster. Just try it out with some mock and test cases. - Mandraenke
Cheers! Makes it clear. Will try out a few cases and benchmark the performance. Coming from an RDBMS world, the whole notion of data duplication (according to the read patterns), knowing the query patterns beforehand seems a brain-twister to me. ;-) - user1189332

1 Answers

3
votes

Your Data Model Seems Good.But If you could add more field to the partition key it will scale well. And you should use Separate Query with executeAsync

If you are using in clause, this means that you’re waiting on this single coordinator node to give you a response, it’s keeping all those queries and their responses in the heap, and if one of those queries fails, or the coordinator fails, you have to retry the whole thing

enter image description here

Source : https://lostechies.com/ryansvihla/2014/09/22/cassandra-query-patterns-not-using-the-in-query-for-multiple-partitions/

Instead of using IN clause, use separate query of each day and execute it with executeAsync.

Java Example :

PreparedStatement statement = session.prepare("SELECT * FROM travellers_checkin where checkinDay = ? and checkinTimestamp >= ? and checkinTimestamp <= ?");

List<ResultSetFuture> futures = new ArrayList<>();
for (int i = 1; i < 4; i++) {
    ResultSetFuture resultSetFuture = session.executeAsync(statement.bind(i, i));
    futures.add(resultSetFuture);
}

for (ResultSetFuture future : futures){
     ResultSet rows = future.getUninterruptibly();
     //You get the result set of each query, merge them here
}