I am working on a Cassandra data model for storing time series (I'm a Cassandra newbie). I have two applications: intraday stock data and sensor data.
The stock data will be saved with a time resolution of one minute. Seven datafields build one timeframe: Symbol, Datetime, Open, High, Low, Close, Volume
I will query the data mostly by Symbol and Date. e.g. give me all data for AAPL between 2013-01-01 and 2013-01-31 ordered by Datetime. The recommendation for cassandra queries is to query whole columns. So you could create five rows with the keys Open, High, Low, Close, Volume. And for each Symbol and Minute an own column. E.g. "AAPL:2013-01-04T130400Z". This would result in a table of five rows and n*NT columns where n = number of symbols, nT = number of minutes. Most of the time I will query date ranges. I.e. all minutes of a day. So I could rearrange the data to have columns named "AAPL:2013-01-04" and rows: OpenT130400Z, HighT130400Z, LowT130400Z, CloseT130400Z, VolumeT130400Z. This would result in a table with n*nD columns (n: number of Symbols, nD: number of Days) and 5*nM rows (nM: number of minutes/entries per day).
To sum up: I have columns, which hold the information for a whole day for one symbol.
I have found a description how to deal with time series data in cassandra here http://www.datastax.com/dev/blog/advanced-time-series-with-cassandra But I don't really get, if they use the hour (1332960000) as a column name or as a row key!? I understood they use the hour as row key and have the small timesteps as columns. So they would have a fixed column number. But that would have disadvantages in reading because I would have to do a range query on keys! Am I right?
Second question: If I have sensor data, which is much more fine grained than 1 minute stock data (let's say I have to save timesteps with a resolution of microseconds) how would I deal with this? If I use columns for saving a composite of sensor channel and hours, and rows for microseconds since the last hour this would result in 3,600,000,000 rows and n*nH columns (n: number of sensors, nH: number of Hours). I could not use the microseconds since last hour for columns because I have 3,6 billion points which is higher than the allowed number of 2 billion columns.
Did I get it? What do you think about this problem? How to solve it?
Thank you!
Best, Malte