Cassandra Time-Series: Allow Filtering, Buckets, or Other

Question

I know there are many time-series questions on here but mine does not seem to comfortably fit with the given solutions. I am also new to Cassandra so I might be approaching this with the wrong mindset. Bear with me.

I am receiving search data in the form:

datetime_searched, term_used, product_found

and the query I would like to make:

Given a start-date and an end-date, return all term-product pairs that fall in that time window. Initially, the window will be a month long. This may (read: will) change.

For example, given the following data:

2013-11-20 00:00:00, "christmas", "decorated tree"
2014-12-01 20:00:00, "christmas", "wrapping paper"
2014-12-23 15:00:00, "christmas", "decorated tree" (duplicate term-product)

and a query for the time-range 2014-12-01 to 2015-01-01, I would like to be able to retrieve:

"christmas", "wrapping paper"
"christmas", "decorated tree"

My initial approach looked like most examples for time series data:

CREATE TABLE search_terms (
  datetime_searched timestamp,
  term_used text,
  product_found text,
  PRIMARY KEY (term_used, date_searched)
);

SELECT term_used, product_found
FROM search_terms 
WHERE datetime_searched > [start] 
AND datetime_searched < [end];

but this requires me to have secondary indexes and/or allow filtering, which seems to be something I should avoid if I'm only capturing a small percentage of the data being filtered.

My second idea was to create time buckets, but this solution seems to work only if I limit the query to the buckets. It also seems to create hotspots - in my initial case, a month-long hotspot. For example, for daily buckets:

CREATE TABLE search_terms_by_day (
  datetime_searched timestamp,
  day_searched timestamp,
  term_used text,
  product_found text,
  PRIMARY KEY (day_searched)
);

SELECT term_used, product_found
FROM search_terms_by_day
WHERE day_searched=[my limited query's bucket];

So what are my options? Do I constrain my requests to the bucket size, possibly creating many CFs with different bucket sizes, all while creating hotspots; am I forced to use secondary indexes; or is there another option I am unaware of?

Thanks in advance.

Jake Greene Jake Greene · Accepted Answer · 2015-04-28T21:33:49

Writing this question has helped me sort out some of my problems. I've come up with an alternative solution which I am more-or-less happy with but will need some fine-tuning.

There is the possibility of calculating all of the time buckets we need to access, making a query for each of these buckets with a filter to grab the entries we need.

CREATE TABLE search_terms_by_day_of_year (
  day_searched int, // 1 - 366
  datetime_searched timestamp,
  term_used text,
  product_found text,
  PRIMARY KEY (day_searched, datetime_searched, term_used, product_found)
);

// Make N of these, with a different day_searched
SELECT term_used, product_found
FROM search_terms_by_week
WHERE day_searched = 51
AND datetime_searched > [start]
AND datetime_searched < [end]

Positives:

Avoids scanning all of the search data
Allows smaller time buckets which in turn reduces our hot spots

Negatives:

Requires logic to determine the partition keys needed
There will be a hotspot for writes for the period of the bucket (in the above example, one day)
A poor choice of bucket size in relation to the query range will require looking through all buckets, negating any gains.
Multiple queries to the database. The smaller the bucket, the more calls needed.

Please let me know if there is a better solution to this

Cassandra Time-Series: Allow Filtering, Buckets, or Other

1 Answers