0
votes

Hello we have a table in Cassandra whose structure is as below

CREATE TABLE dmp.user_profiles_6 (
    vuid text PRIMARY KEY,
    brand_model text,
    first_seen timestamp,
    last_seen timestamp,
    total_day_count int,
    total_usage_count int,
    user_type text
) WITH bloom_filter_fp_chance = 0.01
    AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
    AND comment = ''
    AND compaction = {'class': 'org.apache.cassandra.db.compaction.LeveledCompactionStrategy'}
    AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
    AND crc_check_chance = 1.0
    AND dclocal_read_repair_chance = 0.1
    AND default_time_to_live = 0
    AND gc_grace_seconds = 864000
    AND max_index_interval = 2048
    AND memtable_flush_period_in_ms = 0
    AND min_index_interval = 128
    AND read_repair_chance = 0.1
    AND speculative_retry = '99PERCENTILE';

I read a few articles about data modeling in Cassandra from datastax. In in it said that primary key consists of partition key and clustering key.

Now in above case we have a vuid column which is an identifier for every unique user. It is primary key. We have 400M unique users. So now does it mean that Cassandra is making 400M partitions? Then this must degrade the performance. In one datastax article about data modeling an example table shows primary key on a uuid column which is unique and having a very high cardinality. I am totally confused, can anyone help me identify which column can be set as partition key and which as cluster key?

Queries can be as below: 1. Select record directly on basis of vuid 2. Select vuids on basis of range of last seen or first seen

3

3 Answers

1
votes
  1. Select record directly on basis of vuid >> Your table does that. It already has vuid as a primary key.
  2. Select vuids on basis of range of last seen or first seen >>
    There are two options here: Either add last_seen or first_seen in clustering columns (you can do range selection on clustering columns only)
    In this case you need to provide vuid along with last_seen and first_seen on the query. I don't think you want that.
    OR
    Create another table which has the same data(Yes,in C* we create another table for different query with same data and change the keys as per query. Welcome to data duplication). In this table you have to have to add a dummy column as primary key and make the last_seen and first_seen as clustering keys.You pass these seen dates in query to fetch vuid.

Hope this is clear.

1
votes
you need to create 3 tables as below.
table 1:-
CREATE TABLE dmp.user_profiles_ZZZZ (
    Dummy_column  uuid ,
    vuid text,
     ........other colums
    PRIMARY KEY((Dummy_column,vuid))
) .....

 table 2:-
CREATE TABLE dmp.user_profiles_YYYY (
    Dummy_column  uuid ,
    .......other colums
    PRIMARY KEY((Dummy_column),first_seen)
)  .....

CREATE TABLE dmp.user_profiles_XXXX (
    Dummy_column  uuid ,
    .....other colums
    PRIMARY KEY((Dummy_column),last_seen)
)  .....
0
votes

In Cassandra(Query Driven model), tables are created to satisfy the query this is different from relation database Data modeling.

In cassandra, Primary Key consists of 2 type of keys

1.Partition key -> defines the partitions

2.Clustring key -> defines the order in partition

depending on the uses.

if the column mentioned in Partition key and clustring key are not enough to provide the uniqueness then we need to add Primary key of the relationship in the Primary key.

Apart from the as a tip:-

[Column name XX] = ? -> equality check than add column name in Partition key

[Column name yy] >= ? -> Range check add column name in Clustring key

here in question its not mentioned what is your query which should be served. Please share the query based on that table can be created.