Streaming financial time series stored in Cassandra - poor performance

Question

We're in the process of evaluating Cassandra for use with financial time series data and are trying to understand the best way to store and retrieve the data we need in the most performant way. We are running Cassandra on a Virtual machine to which 8 cores and 8Gb RAM have been allocated. The remaining resources of the host machine (another 8 cores and 12Gb RAM) are used for development of the testing client application. Our data is currently stored in flat files and is of the order of 100-150Gb each day (uncompressed). In terms of retrieving the data from cassandra we need to be able to stream either:

All of the data - i.e. stream data for all securities for an entire day ordered by timestamp
All of the data for a particular time period which is a subset of the entire day ordered by timestamp
Data for a subset of the securities and a particular time period which is a subset of the entire day ordered by timestamp.

We have so far experimented with partitioning the data based on security and day with a table that has the following schema:

create table MarketData (
Security text
,Date date
,Timestamp timestamp
...
other columns
...
primary key((Security,Date),timestamp));

However when we perform a simple paged query from within a C# client application as below it takes roughly 8 secs to retrieve 50K records, which is very poor. We've experimented with different page sizes and a page size of approx. 450 seems to give the least bad results.

var ps = client.Session.Prepare("select security, date, timestamp, toUnixTimestamp(timestamp), from marketdata where security = ? and date = ?");
int pageSize = 450;
var statement = ps.Bind("AAPL_O",new LocalDate(2016,01,12)).SetPageSize(pageSize);
stopwatch.Start();
var rowSet = client.Session.Execute(statement);
foreach (Row row in rowSet)
{
}
stopwatch.Stop();

Furthermore, this kind of a schema would also be problematic in terms of selecting SORTED data across partitions (i.e. for multiple securities) since it involves sorting across partitions which Cassandra doesn't seem to be well suited to.

We have also cosidered partinioning based on minute with the following schema:

create table MarketData ( Year int, Month int, Day int, Hour int, Minute int, Security text ,Timestamp timestamp ... other columns ... primary key((Year,Month,Day,Hour,Minute),timestamp));

However, our concern is that our perlimiary test of paging through the results of a straightforward 'select' statement is so poor.

Are we approaching things in the wrong way? Could our configuration be incorrect? Or is Cassandra maybe not the appropriate bigdata solution for what we are trying to achieve?

Thanks

Another point to consider is : what are these "records"? If they consist of 50Mb each or 50b each the results will be totally different.. — Cedric H.

doanduyhai doanduyhai · Accepted Answer · 2016-01-18T13:32:39

".... poor performance...."

"We are running Cassandra on a Virtual machine "

I think those 2 highlighted words are related :). Out of curiosity, what is the nature of your hard drive ? Shared storage ? SAN ? Spinning disk ? SSD ? Mutualised hard drive ?

Furthermore, this kind of a schema would also be problematic in terms of selecting SORTED data across partitions (i.e. for multiple securities)

Exact, Cassandra does not sort by partition key. You'll probably need to create another table (or a materialized view, new Cassandra 3.0 feature) with PRIMARY KEY ((time_period),security, timestamp) so that you can order by Security

Are we approaching things in the wrong way?

Yes, why do you want to do "performance benchmark" on a virtual machine ? Those 2 ideas are pretty antinomic. The general recommendation with Cassandra is to use dedicated hard drives (spinning disk at least, preferably SSD). Cassandra read performance is strongly bound to your disk I/O.

With virtual machines and virtualized storage, you deactivate all Cassandra optimization for disk throughput. Writing a sequential block of data on a virtualized disk do not guarantee you that the data are effectively written sequentially because the hypervisor/virtual disk controller can re-order the them to split across several blocks on the actual physical disks

Cassandra deployment on virtual machines are only suited for P.O.C to validate a data model & queries. You'll need to have dedicated physical hard drives to benchmark the actual performance of your data model with Cassandra.

Streaming financial time series stored in Cassandra - poor performance

1 Answers