I've noticed that changing a regular Cassandra column to a clustering column can significantly reduce the size of the table in some circumstances.
For this example table:
id UUID K
time TIMESTAMP C
state TINYINT (C)
value DOUBLE
The size of 100000 rows is estimated at 3.9 MB if state
is an ordinary column, or 2.4 MB if state
is a clustering column (estimated using the method in DataStax course DS220).
If you look at how the data is physically stored it isn't hard to see why this difference exists. In the former case there are two internal cells per timestamp - one for state
and one for value
. In the latter case value
is incorporated into the cell key so there is just one cell per timestamp, and the timestamp (part of the cell key) is stored only once.
The second clustering column does not create any new restrictions on what can be queried. SELECT * FROM table WHERE id=? AND time>=? AND time<?
is still fine.
It seems like a win-win situation. Are there any downsides, in particular, performance-wise?
(All I can think of is that if state
is a regular column then it can be omitted from an INSERT and the state
internal cell will never be created. I imagine if state
is a regular column and usually omitted then the table will be very slightly smaller than if state
is a clustering column.)
Additional comments
It's worth noting that in the definition above you can't filter by state
without an equality filter on time
, making it not very useful for filtering state
. And if you put the state
column above time
to resolve this then yes you can filter by state
and time
inequality, but if you want all states (IN clause) then the rows are returned ordered by state
first, then time
, which again is not very useful.