1
votes

I am storing the time series data in cassandra on daily basis. We would like to archive/purge the data older than 2 days on daily basis. We are using Hector API to store the data. Can some one suggest me the approach to delete the cassandra data on daily basis where data is older than 2 days? Using TTL approach for cassandra row is not feasible, as the number of days to delete data is configurable. Right now there is no timestamp column in the table. we are planning to add timestamp column. But the problem is, timestamp alone cannot be used in where clause, as this new column is not part of primary key. Please provide your suggestion.

3
Is your model adapted/designed to something else? Because this doesn't look like a timeseries data in Cassandra: a timestamp like column should be part of the clustering key. - Cedric H.

3 Answers

2
votes

TTL is the right answer, there is an internal timestamp attached to every mutation that is used so you don't need to add one. Manually purging almost never a good idea. You may need to work on your data model a bit, check the datastax academy examples for time series

Also thrift has been frozen for two years and is now officially deprecated (removal in 4.0). Hector and other thrift clients are not really maintained anymore (see here). Using CQL and java driver will give better results with more resources available to learn as well.

0
votes

I don't see what is stopping you from using TTL approach.

TTL can be used, not only while defining schema, but also while saving data into table using datastax cassandra driver.

So, in reality you can have separate TTL for each row, configured by your java code.

Also, as Chris already mentioned, TTL uses internal timestamp for this.

0
votes

Strictly based on what you describe, I think the only solution is to add that timestamp column and add a secondary index on it.

However this is a huge indicator that your data model is far from being adapted to the situation.

Emphasising my initial comment:

Is your model adapted/designed to something else? Because this doesn't look like a timeseries data in Cassandra: a timestamp like column should be part of the clustering key.