Any downside to 'redundant' clustering column?

Question

I've noticed that changing a regular Cassandra column to a clustering column can significantly reduce the size of the table in some circumstances.

For this example table:

id     UUID        K
time   TIMESTAMP   C
state  TINYINT    (C)
value  DOUBLE

The size of 100000 rows is estimated at 3.9 MB if state is an ordinary column, or 2.4 MB if state is a clustering column (estimated using the method in DataStax course DS220).

If you look at how the data is physically stored it isn't hard to see why this difference exists. In the former case there are two internal cells per timestamp - one for state and one for value. In the latter case value is incorporated into the cell key so there is just one cell per timestamp, and the timestamp (part of the cell key) is stored only once.

The second clustering column does not create any new restrictions on what can be queried. SELECT * FROM table WHERE id=? AND time>=? AND time<? is still fine.

It seems like a win-win situation. Are there any downsides, in particular, performance-wise?

(All I can think of is that if state is a regular column then it can be omitted from an INSERT and the state internal cell will never be created. I imagine if state is a regular column and usually omitted then the table will be very slightly smaller than if state is a clustering column.)

Additional comments It's worth noting that in the definition above you can't filter by state without an equality filter on time, making it not very useful for filtering state. And if you put the state column above time to resolve this then yes you can filter by state and time inequality, but if you want all states (IN clause) then the rows are returned ordered by state first, then time, which again is not very useful.

Jim Wartnick Jim Wartnick · Accepted Answer · 2019-02-15T16:13:51

I would think the main difference here is that if it's a clustering column it must be provided with INSERTs as it's part of the primary key. Also, as it's part of the primary key, you can't update it either, which could be problematic for some tables. If you don't have any concerns about either of those two, I don't see any reason why you couldn't add it.

Any downside to 'redundant' clustering column?

2 Answers