1
votes

For monitoring some distributed software I insert their monitoring data into Cassandra table. The columns are metric_type, metric_value, host_name, component_type and time_stamp. The scenario is I collect all the metrics for all the nodes in every second. The time in uniform for all nodes and their metrics. The keys(that differentiate rows) are host_name, component_type, metric_type and time_stamp. I design my table like below:

CREATE TABLE metrics (
    component_type text,
    host_name text,
    metric_type text,
    time_stamp bigint,
    metric_value text,
    PRIMARY KEY ((component_type, host_name, metric_type), general_timestamp)
) WITH CLUSTERING ORDER BY (time_stamp DESC)

where component_type, host_name and metric_type are partitions key and time_stamp is clustering key. The metrics table is suitable for the queries that gets some data according to their timestamp just for a host_name or a metric_type or a component_type, as using partition keys Cassandra will find the partition that data are stored and using clustering key will fetch data from that partition and this is the optimal case for Cassandra queries.

Besides that, I need a query that fetches all data just using time_stamp. For example :

SELECT * from metrics WHERE time_stamp >= 1529632009872 and time_stamp < 1539632009872 ;

I know the metric table is not optimal for the above query, because it should search every partition to fetch data. I guess in this situation we should design another table with the time_stamp as partition key, so data will be fetched from one or some limited number of partitions. But I am not certain about some aspects:

  • Is it optimal to set time_stamp as partition key? because of I insert data into the database every second and the partition key numbers will be a lot!
  • I need my queries to be interval on time_stamp and I know interval conditions are not allowed in partition keys, just allowed on clustering keys!

So what is the best Cassandra data model for such time series data and query?

1
You might want to look into KairosDB. Even if you don’t necessarily use their software, you may want to follow their schema.myron-semack

1 Answers

0
votes

Using time_stamp as partition key is not optimal in my opinion, as it would create a lot of partitions.

I would propose 2 solutions:

1) Go with a "week_first_day" as partition key. You would have to compute the correct week_first_day keys on your application side and then emit multiple select queries.

2) You could use ElasticSearch on top of cassandra. Cassandra remains the primary data source, but you have the freedom, to do complex selects. If you are interested, I would recommend to take a look at Elassandra .