Best practice modeling data for Cassandra databases

Question

I'm new to Cassandra and am looking for a best practice on how to model data that has this general following structure:

The data is "user" based (per customer) , each is supplying a big data file of around 500K-2M entries (periodically updated a few times a day - sometimes full update and sometimes only deltas)

Each data file has certain mandatory data fields (~20 mandatory) but can add additional columns at their discretion (up to ~100).

The additional data fields are NOT necessarily the same for the different users (the names of the fields or the types of those fields)

Example (csv format:)

user_id_1.csv

| column1 (unique key per user_id)  |  column2  |  column3 |   ...   |  column10  |  additionalColumn1  |  ...additionalColumn_n |
|-----------------------------------|-----------|----------|---------|------------|---------------------|------------------------|
| user_id_1_key_1                   |  value    |  value   |  value  |  value     |                ...  |  value                 |
| user_id_1_key_2                   |  ....     |  ....    |  ....   |  ....      |                ...  |  ...                   |
| ....                              |  ...      |  ...     |  ...    |  ...       |                ...  |  ...                   |
| user_id_1_key_2Million            |  ....     |  ....    |  ....   |  ....      |                ...  |  ...                   |


user_id_XXX.csv (notice that the first 10 columns are identical to the other users but the additional columns are different - both the names and their types)

|             column1 (unique key per user_id)              |  column2  |  column3 |   ...   |  column10  |  additionalColumn1 (different types than user_id_1 and others)  |  ...additional_column_x |
|-----------------------------------------------------------|-----------|----------|---------|------------|-----------------------------------------------------------------|-------------------------|
| user_id_XXX_key_1                                         |  value    |  value   |  value  |  value     |                                                            ...  |  value                  |
| user_id_XXX_key_2                                         |  ....     |  ....    |  ....   |  ....      |                                                            ...  |  ...                    |
| ....                                                      |  ...      |  ...     |  ...    |  ...       |                                                            ...  |  ...                    |
| user_id_XXX_key_500_thousand (less rows than other user)  |  ....     |  ....    |  ....   |  ....      |                                                            ...  |  ...                    |

Several options I have considered:

Option 1:

Create a "global" keyspace
Create a big table "data" containing everything

Concatenate a user_id column to all of the other columns to the big table (including the non-mandatory columns). The primary key becomes user_id + "column_1" (column_1 is unique per user_id)

                                 Keyspace
+--------------------------------------------------------------------------+
|                                                                          |
|                                                                          |
|                                      Data_Table                          |
|                +  +--------+-------+--------------------------+-----+    |
|                |  |        |       |                          |     |    |
|                |  +-------------------------------------------------+    |
|                |  |        |       |                          |     |    |
|    many rows   |  +-------------------------------------------------+    |
|                |  |        |       |                          |     |    |
|                |  |        |       |                          |     |    |
|                |  |        |       |                          |     |    |
|                |  |        |       |     Many columns         |     |    |
|                |  |        |       +------------------------> |     |    |
|                |  |        |       |                          |     |    |
|                |  +-------------------------------------------------+    |
|                v  +-------------------------------------------------+    |
|                                                                          |
+--------------------------------------------------------------------------+

A few things that I notice right away:

The user_id repeats itself as many times as entries per user
The rows are very sparse for the additional columns (empty null values) since the users don't necessarily share them
Number of users is relatively small so number of additional columns is not huge (10K columns max)
I could compact the additional columns data per user to one column called "meta data" and share it per all user

Option 2:

Create Keyspace per User_id

Create table "data" per keyspace

+-----------------------------------------------------------------------------------+
| column_1 | column_2 | ... | column_n | additional_column_1 | additional_column_n  |
+-----------------------------------------------------------------------------------+

keyspace_user1         keyspace_user2                     keyspace_user_n
+----------------+    +---------------+                  +---------------+
|                |    |               |                  |               |
|                |    |               |                  |               |
|   +-+-+--+-+   |    |    +-+--+--+  |                  |   +--+--+---+ |
|   | | |  | |   |    |    | |  |  |  |   many keyspaces |   |  |  |   | |
|   | | |  | |   |    |    | |  |  |  | +------------->  |   |  |  |   | |
|   | | |  | |   |    |    | |  |  |  |                  |   |  |  |   | |
|   | | |  | |   |    |    | |  |  |  |                  |   |  |  |   | |
|   +--------+   |    |    +-------+  |                  |   +---------+ |
+----------------+    +---------------+                  +---------------+

notes:

Many keyspaces (keyspace per user)
Avoids adding "user_id" value per each row (I can use the key space name as the user id)
Very few tables per keyspace (in this example only 1 table per keyspace)

Option 3:

1) Create a global keyspace 2) Create a table per user_id (the mandatory columns as well as their additional columns per their table)

+---------------------------------------------------------------+
|                            Keyspace                           |
|                                                               |
|       user_1        user_2                         user_n     |
|    +--+---+--+   +--+--+--+                      +--+--+--+   |
|    |  |   |  |   |  |  |  |                      |  |  |  |   |
|    |  |   |  |   |  |  |  |                      |  |  |  |   |
|    |  |   |  |   |  |  |  |                      |  |  |  |   |
|    |  |   |  |   |  |  |  |                      |  |  |  |   |
|    |  |   |  |   |  |  |  |                      |  |  |  |   |
|    +--+---+--+   +--+--+--+                      +--+--+--+   |
|                                                               |
|                                                               |
+---------------------------------------------------------------+

Notes

Global keyspace
A table per user_id ("many" tables)
Avoids duplicating the user id per row

Option 4: (Does this make sense?)

Create a multiple keyspaces (for instance "x" number of keyspaces) each holding a range of tables (table per user)

                      keyspace_1                                                                                keyspace_x
+---------------------------------------------------------------+                         +---------------------------------------------------------------+
|                                                               |                         |                                                               |
|                                                               |                         |                                                               |
|       user_1        user_2                        user_n/x    |                         |     user_n-x      user_n-x+1                       user_n     |
|    +--+---+--+   +--+--+--+                      +--+--+--+   |                         |    +--+------+   +--+--+--+                      +--+--+--+   |
|    |  |   |  |   |  |  |  |                      |  |  |  |   |        "X" keyspaces    |    |  |   |  |   |  |  |  |                      |  |  |  |   |
|    |  |   |  |   |  |  |  |                      |  |  |  |   | +---------------------> |    |  |   |  |   |  |  |  |                      |  |  |  |   |
|    |  |   |  |   |  |  |  |                      |  |  |  |   |                         |    |  |   |  |   |  |  |  |                      |  |  |  |   |
|    |  |   |  |   |  |  |  |                      |  |  |  |   |                         |    |  |   |  |   |  |  |  |                      |  |  |  |   |
|    |  |   |  |   |  |  |  |                      |  |  |  |   |                         |    |  |   |  |   |  |  |  |                      |  |  |  |   |
|    +--+---+--+   +--+--+--+                      +--+--+--+   |                         |    +--+---+--+   +--+--+--+                      +--+--+--+   |
|                                                               |                         |                                                               |
|                                                               |                         |                                                               |
+---------------------------------------------------------------+                         +---------------------------------------------------------------+

Notes:

Multiple keyspaces
Multiple tables per user
Requires a "lookup" to figure out which keyspace contains the required table

Option 5:

Split data to multiple tables and multiple keyspaces

Notes: 1. Requires "joining" information from multiple tables in some cases 2. Seems to be more complicated

General notes for all scenarios:

There are a magnitude less writes than reads
Many millions of reads per day
Traffic fluctuates per user_id - some user_ids have a lot of traffic and some user_ids have much less traffic . Would need to tune per this metric
Some user_ids are updated (writes) more frequently than others
We have multiple data centers across geographies and should sync
There is a long tail per primary key (some keys are accessed many times while other keys are rarely accessed)

I'm new to cassandra myself, but option 1 makes the most sense to me. Cassandra is built for sparse columns. Also, have a look at composite primary keys - PRIMARY KEY(key_part_one, key_part_two). have a look here: stackoverflow.com/a/24953331/1277048 . this offers some flexibility in retrieval over the concatenation approach: you could read ALL the lines with key_part_one in one request OR just the line that matches (key_part_one, key_part_two). — FuzzyAmi
List up all your select query then design your data model according to your query. — Ashraful Islam

siculars siculars · Accepted Answer · 2017-09-27T01:29:19

This type of integration challenge is usually solved by an EAV (Entity Attribute Value) data model in relational systems (like the one Ashrafaul demonstrates). The key consideration when considering an EAV model is an unbounded number of columns. An EAV data model may, of course, be mimicked in a CQL system like Cassandra or ScyllaDB. The EAV model lends itself nicely to writes but presents challenges when reading. You haven't really detailed your read considerations. Do you need all columns back or do you need specific columns back per user?

Files

Having said that, there are some further considerations inherent to Cassandra and ScyllaDB that may point you towards a unified EAV model over some of the designs you describe in your question. Both Cassandra and ScyllaDB lay out keyspaces and databases as files on disk. The number of files are basically products of number of keyspaces times number of tables. So the more keyspaces, tables or combination of the two you have, the more files you'll have on disk. This may be an issue with file descriptors and other os file juggling issues. Due to the long tail of access you mentioned it may be the case that every file is open all the time. That is not so desirable, especially when starting from a cold boot.

[edit for clarity] All things being equal, one keyspace/table will always produce less files than many keyspace/tables. This has nothing to do with the amount of data stored or compaction strategy.

Wide Rows

But getting back to the data model. Ashraful's model has a primary key (userid) and another clustering key (key->column1). Due to the number of "entries" in each user file (500K-2M) and assuming each entry is a row comprised of avg 60 columns, what you're basically doing is creating 500k-2m * 60 avg columns rows per partition key thereby creating very large partitions. Cassandra and Scylla generally don't like very large partitions. Can they handle large partitions, sure. In practice do large partitions impact performance, yes.

Updates or versioning

You mention updates. The base EAV model will only represent the most recent update. There is no versioning. What you could do is add time as a clustering key to ensure that you maintain the historical values of your columns over time.

Reads

If you want all the columns back you could just serialize everything into a json object and put it in a single column. But I imagine that's not what you want. In the primary key (partition key) model of a key/value based system like Cassandra and Scylla, you need to know all the components of the key to get your data back. If you put column1, the unique row identifier, into your primary key you will need to know it in advance, likewise also the other column names if those get put in the primary key as well.

Partitions and Composite Partition Keys

Number of partitions dictate the parallelism of your cluster. The number of total partitions, or cardinality of partitions in your total corpus, has an affect on the utilization of your cluster hardware. More partitions = better parallelism and higher resource utilization.

What I might do here is modify the PRIMARY KEY to include column1. Then I would use column as a clustering key (which not only dictates uniqueness within a partition but also sort order - so consider this in you column naming conventions).

In the following table definition you would need to provide the userid and column1 as equalities in your WHERE clause.

CREATE TABLE data (
    userid bigint,
    column1 text,
    column text,
    value text,
    PRIMARY KEY ( (userid, column1), column )
);

I'd also have a separate table, maybe columns_per_user, that records all columns for each userid. Something like

CREATE TABLE columns_per_user (
    userid bigint,
    max_columns int,
    column_names text
    PRIMARY KEY ( userid )
);

Where max_columns is the total number of columns for this user and column_names are the actual column names. You might also have a column for total number of entries per user, something like user_entries int which would basically be the number of rows in each user csv file.

Best practice modeling data for Cassandra databases

2 Answers