1
votes

I have a straightforward schema in Cassandra where I am storing a large amount of financial markets time series as per Pattern 1 in the recommendation here ie: enter image description here There are about 2000 tickers, each in their own row, and there are between 100k and 1m time series data points ("columns") for each of these. How do I efficiently get the ticker names (ie the 2000 ticker names? In the above analogy, the blue "WeatherStationID" cells?). The only difference from the above is that I have the rows ordered by time DESC so my actual table looks like this:

CREATE TABLE ts.mindesc (
    ticker text,
    time timestamp,
    close float,
    high float,
    low float,
    numevents int,
    open float,
    source text,
    value float,
    PRIMARY KEY (ticker, time)
) WITH CLUSTERING ORDER BY (time DESC)

Do I have to select * from the whole table above a certain timestamp value and parse this? The problem with that is that not all tickers have recent data so I might have to go quite far back in terms of time to capture them all, in which case for tickers with lots of recent data I could be getting through large amounts of redundant information.

1

1 Answers

2
votes

Actually, you should be able to query the partition keys very quickly by using DISTINCT:

SELECT DISTINCT ticker FROM ts.mindesc;

While 100k rows in CQL probably wouldn't perform well with an unbound WHERE, 2000 partition keys shouldn't be a problem.