1
votes

Using cassandra version 3.11.4 we imported several days of 'time series like' data in a table created with TimeWindowCompactionStrategy, compaction_window_unit in hours and compaction_window_size of 1:

CREATE TABLE MYTABLE (
  some_fields text,
(...)
AND compaction = {
  'class' : 'TimeWindowCompactionStrategy',
    'compaction_window_unit': 'HOURS',
    'compaction_window_size': 1
};

since this is historical data imported from another DB, we changed the timestamp on the insert query in this way:

INSERT INTO MYTABLE (...) USING TIMESTAMP [timestamp of the record] AND TTL ...

where [timestamp of the record] is the timestamp of every time-series record inserted.

Apparently this method worked, as verified enabling TRACE level logging on org.apache.cassandra.db.compaction package:

TRACE [CompactionExecutor:421] ...TimeWindowCompactionStrategy.java:252 - buckets {
1523124000000=[BigTableReader(path='.../md-487-big-Data.db')], 
1523070000000=[BigTableReader(path='.../md-477-big-Data.db')], 
1523109600000=[BigTableReader(path='.../md-530-big-Data.db')], 
1523134800000=[BigTableReader(path='.../md-542-big-Data.db')] }, 
max timestamp 1523134800000

Where we found several buckets "one hour" big.

The problem came when we run nodetool compact on every cassandra node.

What we expected was to obtain a single sstable for each "one hour bucket". What we got was a single huge sstable (per node), with all rows merged!

Is this the supposed behavior? are we doing something wrong?

1
I also wanted to add that I tried with -s and it still created one big sstable file in my case. That outcome very much contradicted what the docs say for that option: "Use -s to not create a single big file" - itzg

1 Answers

1
votes

This is expected behavior. You can either take the node offline and split the sstables into X, or wait for all TTLs to expire and then watch the single large sstable get cleaned up. Remember to turn off repair on the tables with STWS, otherwise, things can get messy. I learned that the hard way. Otherwise, it's a great compaction strategy for time series data.