We're in the process of evaluating Cassandra for use with financial time series data and are trying to understand the best way to store and retrieve the data we need in the most performant way. We are running Cassandra on a Virtual machine to which 8 cores and 8Gb RAM have been allocated. The remaining resources of the host machine (another 8 cores and 12Gb RAM) are used for development of the testing client application. Our data is currently stored in flat files and is of the order of 100-150Gb each day (uncompressed). In terms of retrieving the data from cassandra we need to be able to stream either:
- All of the data - i.e. stream data for all securities for an entire day ordered by timestamp
- All of the data for a particular time period which is a subset of the entire day ordered by timestamp
- Data for a subset of the securities and a particular time period which is a subset of the entire day ordered by timestamp.
We have so far experimented with partitioning the data based on security and day with a table that has the following schema:
create table MarketData (
Security text
,Date date
,Timestamp timestamp
...
other columns
...
primary key((Security,Date),timestamp));
However when we perform a simple paged query from within a C# client application as below it takes roughly 8 secs to retrieve 50K records, which is very poor. We've experimented with different page sizes and a page size of approx. 450 seems to give the least bad results.
var ps = client.Session.Prepare("select security, date, timestamp, toUnixTimestamp(timestamp), from marketdata where security = ? and date = ?");
int pageSize = 450;
var statement = ps.Bind("AAPL_O",new LocalDate(2016,01,12)).SetPageSize(pageSize);
stopwatch.Start();
var rowSet = client.Session.Execute(statement);
foreach (Row row in rowSet)
{
}
stopwatch.Stop();
Furthermore, this kind of a schema would also be problematic in terms of selecting SORTED data across partitions (i.e. for multiple securities) since it involves sorting across partitions which Cassandra doesn't seem to be well suited to.
We have also cosidered partinioning based on minute with the following schema:
create table MarketData ( Year int, Month int, Day int, Hour int, Minute int, Security text ,Timestamp timestamp ... other columns ... primary key((Year,Month,Day,Hour,Minute),timestamp));
However, our concern is that our perlimiary test of paging through the results of a straightforward 'select' statement is so poor.
Are we approaching things in the wrong way? Could our configuration be incorrect? Or is Cassandra maybe not the appropriate bigdata solution for what we are trying to achieve?
Thanks