I'm new to Cassandra and am looking for a best practice on how to model data that has this general following structure:
The data is "user" based (per customer) , each is supplying a big data file of around 500K-2M entries (periodically updated a few times a day - sometimes full update and sometimes only deltas)
Each data file has certain mandatory data fields (~20 mandatory) but can add additional columns at their discretion (up to ~100).
The additional data fields are NOT necessarily the same for the different users (the names of the fields or the types of those fields)
Example (csv format:)
user_id_1.csv
| column1 (unique key per user_id) | column2 | column3 | ... | column10 | additionalColumn1 | ...additionalColumn_n |
|-----------------------------------|-----------|----------|---------|------------|---------------------|------------------------|
| user_id_1_key_1 | value | value | value | value | ... | value |
| user_id_1_key_2 | .... | .... | .... | .... | ... | ... |
| .... | ... | ... | ... | ... | ... | ... |
| user_id_1_key_2Million | .... | .... | .... | .... | ... | ... |
user_id_XXX.csv (notice that the first 10 columns are identical to the other users but the additional columns are different - both the names and their types)
| column1 (unique key per user_id) | column2 | column3 | ... | column10 | additionalColumn1 (different types than user_id_1 and others) | ...additional_column_x |
|-----------------------------------------------------------|-----------|----------|---------|------------|-----------------------------------------------------------------|-------------------------|
| user_id_XXX_key_1 | value | value | value | value | ... | value |
| user_id_XXX_key_2 | .... | .... | .... | .... | ... | ... |
| .... | ... | ... | ... | ... | ... | ... |
| user_id_XXX_key_500_thousand (less rows than other user) | .... | .... | .... | .... | ... | ... |
Several options I have considered:
Option 1:
- Create a "global" keyspace
- Create a big table "data" containing everything
Concatenate a user_id column to all of the other columns to the big table (including the non-mandatory columns). The primary key becomes user_id + "column_1" (column_1 is unique per user_id)
Keyspace +--------------------------------------------------------------------------+ | | | | | Data_Table | | + +--------+-------+--------------------------+-----+ | | | | | | | | | | | +-------------------------------------------------+ | | | | | | | | | | many rows | +-------------------------------------------------+ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Many columns | | | | | | | +------------------------> | | | | | | | | | | | | | +-------------------------------------------------+ | | v +-------------------------------------------------+ | | | +--------------------------------------------------------------------------+
A few things that I notice right away:
- The user_id repeats itself as many times as entries per user
- The rows are very sparse for the additional columns (empty null values) since the users don't necessarily share them
- Number of users is relatively small so number of additional columns is not huge (10K columns max)
- I could compact the additional columns data per user to one column called "meta data" and share it per all user
Option 2:
Create Keyspace per User_id
Create table "data" per keyspace
+-----------------------------------------------------------------------------------+
| column_1 | column_2 | ... | column_n | additional_column_1 | additional_column_n |
+-----------------------------------------------------------------------------------+
keyspace_user1 keyspace_user2 keyspace_user_n
+----------------+ +---------------+ +---------------+
| | | | | |
| | | | | |
| +-+-+--+-+ | | +-+--+--+ | | +--+--+---+ |
| | | | | | | | | | | | | many keyspaces | | | | | |
| | | | | | | | | | | | | +-------------> | | | | | |
| | | | | | | | | | | | | | | | | | |
| | | | | | | | | | | | | | | | | | |
| +--------+ | | +-------+ | | +---------+ |
+----------------+ +---------------+ +---------------+
notes:
- Many keyspaces (keyspace per user)
- Avoids adding "user_id" value per each row (I can use the key space name as the user id)
- Very few tables per keyspace (in this example only 1 table per keyspace)
Option 3:
1) Create a global keyspace 2) Create a table per user_id (the mandatory columns as well as their additional columns per their table)
+---------------------------------------------------------------+
| Keyspace |
| |
| user_1 user_2 user_n |
| +--+---+--+ +--+--+--+ +--+--+--+ |
| | | | | | | | | | | | | |
| | | | | | | | | | | | | |
| | | | | | | | | | | | | |
| | | | | | | | | | | | | |
| | | | | | | | | | | | | |
| +--+---+--+ +--+--+--+ +--+--+--+ |
| |
| |
+---------------------------------------------------------------+
Notes
- Global keyspace
- A table per user_id ("many" tables)
- Avoids duplicating the user id per row
Option 4: (Does this make sense?)
Create a multiple keyspaces (for instance "x" number of keyspaces) each holding a range of tables (table per user)
keyspace_1 keyspace_x
+---------------------------------------------------------------+ +---------------------------------------------------------------+
| | | |
| | | |
| user_1 user_2 user_n/x | | user_n-x user_n-x+1 user_n |
| +--+---+--+ +--+--+--+ +--+--+--+ | | +--+------+ +--+--+--+ +--+--+--+ |
| | | | | | | | | | | | | | "X" keyspaces | | | | | | | | | | | | | |
| | | | | | | | | | | | | | +---------------------> | | | | | | | | | | | | | |
| | | | | | | | | | | | | | | | | | | | | | | | | | | |
| | | | | | | | | | | | | | | | | | | | | | | | | | | |
| | | | | | | | | | | | | | | | | | | | | | | | | | | |
| +--+---+--+ +--+--+--+ +--+--+--+ | | +--+---+--+ +--+--+--+ +--+--+--+ |
| | | |
| | | |
+---------------------------------------------------------------+ +---------------------------------------------------------------+
Notes:
- Multiple keyspaces
- Multiple tables per user
- Requires a "lookup" to figure out which keyspace contains the required table
Option 5:
Split data to multiple tables and multiple keyspaces
Notes: 1. Requires "joining" information from multiple tables in some cases 2. Seems to be more complicated
General notes for all scenarios:
- There are a magnitude less writes than reads
- Many millions of reads per day
- Traffic fluctuates per user_id - some user_ids have a lot of traffic and some user_ids have much less traffic . Would need to tune per this metric
- Some user_ids are updated (writes) more frequently than others
- We have multiple data centers across geographies and should sync
- There is a long tail per primary key (some keys are accessed many times while other keys are rarely accessed)