For monitoring some distributed software I insert their monitoring data into Cassandra table. The columns are metric_type
, metric_value
, host_name
, component_type
and time_stamp
. The scenario is I collect all the metrics for all the nodes in every second. The time in uniform for all nodes and their metrics. The keys(that differentiate rows) are host_name
, component_type
, metric_type
and time_stamp
. I design my table like below:
CREATE TABLE metrics (
component_type text,
host_name text,
metric_type text,
time_stamp bigint,
metric_value text,
PRIMARY KEY ((component_type, host_name, metric_type), general_timestamp)
) WITH CLUSTERING ORDER BY (time_stamp DESC)
where component_type
, host_name
and metric_type
are partitions key and time_stamp
is clustering key.
The metrics
table is suitable for the queries that gets some data according to their timestamp just for a host_name
or a metric_type
or a component_type
, as using partition keys Cassandra will find the partition that data are stored and using clustering key will fetch data from that partition and this is the optimal case for Cassandra queries.
Besides that, I need a query that fetches all data just using time_stamp
. For example :
SELECT * from metrics WHERE time_stamp >= 1529632009872 and time_stamp < 1539632009872 ;
I know the metric
table is not optimal for the above query, because it should search every partition to fetch data. I guess in this situation we should design another table with the time_stamp
as partition key, so data will be fetched from one or some limited number of partitions. But I am not certain about some aspects:
- Is it optimal to set
time_stamp
as partition key? because of I insert data into the database every second and the partition key numbers will be a lot! - I need my queries to be interval on
time_stamp
and I know interval conditions are not allowed in partition keys, just allowed on clustering keys!
So what is the best Cassandra data model for such time series data and query?