0
votes

Here is a simple example of the user table in cassandra. What is best strategy to create a primary key.

My requirements are

  1. search by uuid
  2. search by username
  3. search by email

All the keys mentioned will be high cardinality keys. Also at any moment I will be having only one of them to search

PRIMARY KEY(uid,username,email)

What if I have only the username ?, Then the above primary key is not use ful. I am not able visualize a solution to achieve this using compound primary key?

what are other options? should we go with a new table with username to uid, then search the user table. ?

From all articles out there on the internet recommends not to create secondary index for high cardinality keys

CREATE TABLE medicscity.user (
    uid uuid,
    fname text,
    lname text,
    user_id text,
    email_id text,
    password text,
    city text,
    state_id int,
    country_id int,
    dob timestamp,
    zipcode text,
    PRIMARY KEY (??)
) 

How do we solve this kind of situation ?

2

2 Answers

1
votes

Yes, you need to go with duplicate tables.

If ever in Cassandra you face a situation in which you will have to query a table based on column1, column2 or column3 independently. You will have to duplicate the tables.

Now, how much duplication you have to use, is individual choice.

Like, in this example, you can either duplicate table with full data. Or, you can simply create a new table column1 (partition), column2, column 3 as primary key in main table. Create a new table with primary key of column1, column2, column3 and partition key on column2. Another one with same primary key and partition key on column3.

So, your data duplicate will be row, but in this case you will end up querying data twice. One from duplicate table, and one from full fledged table.

Big data technology, is there to speed up computation and let your system scale horizontally, and it comes at the expense of disk/storage. I mean just look at everything, even its base of replication factor does duplication of data.

1
votes

Your PRIMARY KEY(uuid,username,email) don't fit your requirement. Because you can't search for the clustering column without fill the Partition Key, and even the second clustering column without fill the first clustering column.

e.g. you cannot search for username without uuid in WHERE clause and cannot search for email without uuid and username too.

All you need is the denormalization and duplicate data.

Denormalization and duplication of data is a fact of life with Cassandra. Don’t be afraid of it. Disk space is generally the cheapest resource (compared to CPU, memory, disk IOPs, or network), and Cassandra is architected around that fact. In order to get the most efficient reads, you often need to duplicate data.

In your case, you need to create 3 tables that have the same column (data that you want to get), but these 3 tables will have different PRIMARY KEY, one have uuid as PK, one have username as PK, and one have email as PK. :)