For monitoring some distributed software I insert their monitoring data into Cassandra table. The columns are metric_type, metric_value, host_name, component_type and time_stamp. The scenario is I collect all the metrics for all the nodes in every second. The time in uniform for all nodes and their metrics. The keys(that differentiate rows) are host_name, component_type, metric_type and time_stamp. I design my table like below:
CREATE TABLE metrics (
component_type text,
host_name text,
metric_type text,
time_stamp bigint,
metric_value text,
PRIMARY KEY ((component_type, host_name, metric_type), general_timestamp)
) WITH CLUSTERING ORDER BY (time_stamp DESC)
where component_type, host_name and metric_type are partitions key and time_stamp is clustering key.
The metrics table is suitable for the queries that gets some data according to their timestamp just for a host_name or a metric_type or a component_type, as using partition keys Cassandra will find the partition that data are stored and using clustering key will fetch data from that partition and this is the optimal case for Cassandra queries.
Besides that, I need a query that fetches all data just using time_stamp. For example :
SELECT * from metrics WHERE time_stamp >= 1529632009872 and time_stamp < 1539632009872 ;
I know the metric table is not optimal for the above query, because it should search every partition to fetch data. I guess in this situation we should design another table with the time_stamp as partition key, so data will be fetched from one or some limited number of partitions. But I am not certain about some aspects:
- Is it optimal to set
time_stampas partition key? because of I insert data into the database every second and the partition key numbers will be a lot! - I need my queries to be interval on
time_stampand I know interval conditions are not allowed in partition keys, just allowed on clustering keys!
So what is the best Cassandra data model for such time series data and query?