4
votes

I'm using Cassandra to store historical data. It's a collection of various objects that change it's value in time.

Column Family: Object type
Row: Object Id
Column Name: Timestamp
Column Value: Value at given time

At some time, the data becomes 'old' and instead of deleting it I want to store it somewhere else (like another Column family) or 'tag' in some way not to be retrieved along with the rest of the data.

Which is the fastest way to do this? At the moment I'm using Hector to do this:
1.Read the data (Using SliceQuery)
2.Write the data in antoher column family (Using ColumnFamilyUpdater)
3.Delete old data (Also using ColumnFamilyUpdater)

Not sure if it's the best practice to do this, but i'm quite new to Cassandra...
Thanks.

1

1 Answers

2
votes

Your data will not only take place on HDD, but it will also consume JVM Heap because row bloom filters are always read on start-up - it's important to remember that.

Your solution is fine - you need to read this data and move it somewhere else. Now there are two options:

  1. Generate reverse index, so that you can access old data in fast way.
  2. Go over all data to find old records. If you data set is divided over many Cassandra nodes consider Hadoop Map Reduce

First solution will provide fast access to old data, but each insert operation will have to update index, which still in Cassandra case is super fast.

Second solution will not require extra inserts during daily usage, but it would require full table scan when you move old data. This is perfect, if you can run such jobs in the night.