Cassandra Data Modelling: Use a Map or have a lot of empty columns?

Question

I have about ~20-30ish columns that I would need to store in my column family in total. However, my data comes in different variations. I have different objects that belong together logically but are not having the same fields (fields as in key names). Sometimes, 5 fields are provided, sometimes 7 fields and so on. All of them share a portion of fields that are always provided though.

A row I insert in this column family will never have all of the columns filled. When using a Map, I could add key/values based on the object type and will not have the possible overhead that is introduced by my other model.

I am concerned about having a lot of empty columns in each row.

A possible downside of using a Map is that you can't have an index for map keys and map values coexist.

Questions gathered:

Do you suggest me to use a Map or just add all of the columns I may need to my column family?
I assume that querying the data based on keys/values in the Map is way slower than "directly" accessing them from the columns. Is this correct?
What downsides are there when I have a lot of empty columns for each row? Overhead?
Is it possible to have a "generic" value type when using a Map? I want to store different data, mostly Strings but also Floats and Integers. Do I need to use a map<text,text> and cast the values within my application?

I am using Cassandra 3.0.8 | CQL spec 3.4.0 | Native protocol v4

Thanks

riccamini riccamini · Accepted Answer · 2016-07-19T14:54:52

I think that having sparse column values is totally fine since that's one of the reason why BigTable and all related solutions implementing the same sparse map data model were created for.

I will be more concerned about limitations in the use of cql collections instead, as pointed out in another S.O. answer here.

Regarding your specific questions:

I will personally use plain columns.
It depends on the access pattern. Do you need all the columns in the map? If not, be aware that Cassandra will retrieve the collection as a whole, so you will get all the data even if not needed.
I don't see any overhead here: data will be stored contiguously ignoring empty columns

Anyway, You can find some info about Cassandra's limitations here. It's an old page, but I assume you can use them as lower bounds for the updated values.

Hope it helps.

Cassandra Data Modelling: Use a Map or have a lot of empty columns?

2 Answers