1
votes

I want to store massive amounts of time series, as efficiently as possible. Speed is important, but not as important as storage.

My data consists of the name of a stock, followed by 1 minute data for 15 years. The data begins precised on Jan 1, 2000, and the number of minutes each day is precisely 390.

So I don't need to store the timeStamp of each series, because I can calculate that automatically.
So instead of this:

Apple [timeStamp:value][timeStamp:value]

I want this:

Apple [value][value]

Is there a way to load this sort of data in Cassandra so it only stores the sequential value, and not the timestamp for each value.

Presumably, by using a timestamp for each series it would double the storage required: if each timestamp and value is 8 bytes, it would take up 50 terabytes instead of the 25 terabytes if only the value were stored.

1

1 Answers

0
votes

Cassandra has the list type, which can store up to 64K elements. Since 15 years has way more than 64K minutes, you would need some secondary keys to break it down into 64K or smaller groups.

Suppose you decided to store it by day (1440 minutes per day), then you could define the table like this:

CREATE TABLE stock_values_by_day (
  stock_name text,
  year int,
  day_number_within_year int,
  values list<int>,
  PRIMARY KEY (stock_name, year, day_number_within_year) );

So the stock name would be the partition key, and the year and day number would be clustering columns.

Then you'd store 1440 data points for each day in the list field. So you wouldn't be using very much space for the time keys, and you'd be able to query for data on a per stock per day basis, and could also do range queries for multiple days within a year.

How you broke it down would depend on the level of granularity you want when accessing the data (e.g. per day, per month etc.).

Another way would be to store the data in a blob field. In your application you'd encode your data, say a year of values, into a binary blob and save it that way. They when you read it out you'd have to expand the binary blob back into the original array of values.