Cassandra filtering by date with a secondary index

Question

I have a requirement to answer the following queries:

return the number of new customers per quarter (up to 36 months)
list the new customers per quarter (up to 36 months)

I've created the following table in Cassandra to deal with this:

CREATE TABLE first_purchase_by_shopper_date
(
    shop_id                     uuid,
    shopper_id                  uuid,
    dt_first_purchase           timestamp,

    ... (some text fields)

    PRIMARY KEY ((shop_id, shopper_id))
);

In order to be able to answer this query in Cassandra, I need to be able to filter this data on the dt_first_purchase field.

But if I add dt_first_purchase to the primary key, then it makes the row non-unique to a shopper - and therefore we get multiple entries in the table - but we only ever want one entry per shopper.

so my insert statement would look like

Insert into first first_purchase_by_shopper_date (shop_id, shopper_id, dt_first_purchase, ... ) Values(...) If Not Exists;

The if not exists at the end ensures that the entry is only written if none exists already (e.g. no update is performed on an existing record.)

How can I filter by date on this table - is a secondary index on the dt_first_purchase column my only option - and isn't this undesirable?

Aaron Aaron · Accepted Answer · 2014-12-18T16:07:36

How can I filter by date on this table - is a secondary index on the dt_first_purchase column my only option - and isn't this undesirable?

You could certainly try a secondary index on dt_first_purchase (and querying by range on that would also require the use of the ALLOW FILTERING directive). For performance (especially with a large cluster), I do not recommend that.

But first and foremost, understand that Cassandra is designed around returning data for a specific key on a specific data partition. This means that the best way for you to query your data by a date range, is to first partition your data by a key that makes sense for your model. For instance, if you had a primary key defined like this:

PRIMARY KEY ((shop_id), dt_first_purchase, shopper_id)

basically, a record of which shop (shop_id) recorded a first purchase (dt_first_purchase) for a particular shopper (shopper_id)

With your data partitioned by shop (shop_id) you could then query the first purchases of new shoppers for a particular shop_id like this:

aploetz@cqlsh:stackoverflow> SELECT * 
FROM first_purchase_by_shopper_date 
WHERE shop_id=ce1089f6-c613-4d5b-a975-5dfd677b46f9 
AND dt_first_purchase >= '2014-01-01 00:00:00' 
AND dt_first_purchase < '2014-04-01 00:00:00';

 shop_id                              | dt_first_purchase        | shopper_id                           | value
--------------------------------------+--------------------------+--------------------------------------+-------
 ce1089f6-c613-4d5b-a975-5dfd677b46f9 | 2014-02-12 18:33:22-0600 | a7480417-aaf8-42b1-85dd-5d9a4a30c204 | shopper1
 ce1089f6-c613-4d5b-a975-5dfd677b46f9 | 2014-03-13 11:33:22-0500 | 07db2b71-2dc7-421d-bf73-82a5f6c55f89 | shopper2

(2 rows)

Additionally, you could then count the number of first purchases (new shoppers) for that particular shop and date range, like this:

aploetz@cqlsh:stackoverflow> SELECT COUNT(*) FROM first_purchase_by_shopper_date
WHERE shop_id=ce1089f6-c613-4d5b-a975-5dfd677b46f9 
AND dt_first_purchase >= '2014-01-01 00:00:00' 
AND dt_first_purchase < '2014-04-01 00:00:00';

 count
-------
     2

(1 rows)

Please note that this specific example may not work for your use case. Take it for what it is: a demonstration of how partitioning and querying work in Cassandra.

For more information, check out Patrick McFadin's article on Getting Started With Time Series Data Modeling. He discusses ways to solve a use case that is similar to yours.

Cassandra filtering by date with a secondary index

1 Answers