2
votes

I know there are many time-series questions on here but mine does not seem to comfortably fit with the given solutions. I am also new to Cassandra so I might be approaching this with the wrong mindset. Bear with me.

I am receiving search data in the form:

datetime_searched, term_used, product_found

and the query I would like to make:

Given a start-date and an end-date, return all term-product pairs that fall in that time window. Initially, the window will be a month long. This may (read: will) change.

For example, given the following data:

2013-11-20 00:00:00, "christmas", "decorated tree"
2014-12-01 20:00:00, "christmas", "wrapping paper"
2014-12-23 15:00:00, "christmas", "decorated tree" (duplicate term-product)

and a query for the time-range 2014-12-01 to 2015-01-01, I would like to be able to retrieve:

"christmas", "wrapping paper"
"christmas", "decorated tree"

My initial approach looked like most examples for time series data:

CREATE TABLE search_terms (
  datetime_searched timestamp,
  term_used text,
  product_found text,
  PRIMARY KEY (term_used, date_searched)
);

SELECT term_used, product_found
FROM search_terms 
WHERE datetime_searched > [start] 
AND datetime_searched < [end];

but this requires me to have secondary indexes and/or allow filtering, which seems to be something I should avoid if I'm only capturing a small percentage of the data being filtered.

My second idea was to create time buckets, but this solution seems to work only if I limit the query to the buckets. It also seems to create hotspots - in my initial case, a month-long hotspot. For example, for daily buckets:

CREATE TABLE search_terms_by_day (
  datetime_searched timestamp,
  day_searched timestamp,
  term_used text,
  product_found text,
  PRIMARY KEY (day_searched)
);

SELECT term_used, product_found
FROM search_terms_by_day
WHERE day_searched=[my limited query's bucket];

So what are my options? Do I constrain my requests to the bucket size, possibly creating many CFs with different bucket sizes, all while creating hotspots; am I forced to use secondary indexes; or is there another option I am unaware of?

Thanks in advance.

1

1 Answers

5
votes

Writing this question has helped me sort out some of my problems. I've come up with an alternative solution which I am more-or-less happy with but will need some fine-tuning.

There is the possibility of calculating all of the time buckets we need to access, making a query for each of these buckets with a filter to grab the entries we need.

CREATE TABLE search_terms_by_day_of_year (
  day_searched int, // 1 - 366
  datetime_searched timestamp,
  term_used text,
  product_found text,
  PRIMARY KEY (day_searched, datetime_searched, term_used, product_found)
);

// Make N of these, with a different day_searched
SELECT term_used, product_found
FROM search_terms_by_week
WHERE day_searched = 51
AND datetime_searched > [start]
AND datetime_searched < [end]

Positives:

  • Avoids scanning all of the search data
  • Allows smaller time buckets which in turn reduces our hot spots

Negatives:

  • Requires logic to determine the partition keys needed
  • There will be a hotspot for writes for the period of the bucket (in the above example, one day)
  • A poor choice of bucket size in relation to the query range will require looking through all buckets, negating any gains.
  • Multiple queries to the database. The smaller the bucket, the more calls needed.

Please let me know if there is a better solution to this