Consider the problem of storing users and their contacts. There are about a 100 million users, each has a few hundred contacts and on an average contacts are 1kb in size. There may be some users with too many contacts (>5000) and there may be some contacts that are much (say 10x) bigger than the average of 1kb. Users actively add contacts and less often also delete them. Contacts are not pointers to other users but just a bundle of information.
There are two kinds of queries -
- Given a user and a contact name, lookup the contact details
- Given a user, look up all associated contact names
I was thinking of a contacts table like this -
CREATE TABLE contacts {
user_name text,
contact_name text,
contact_details map<text, text>,
PRIMARY KEY ( (user_name, contact_name) )
// ^ Notice the composite primary key
}
The choice of composite primary key is due to the number and size of contacts per user. I wanted one contact per row.
This table easily addresses the query of looking up a contact's details given a user and a contact name.
I'm looking for suggestions to address the second query.
Two options (with related concerns) on my mind -
- Create a second table called contact_names_by_user, with user_name as the partition key and contact_name as a clustering key. Concern: If there a user with way too many contacts (say 20k), would that result in a non-optimally wide row?
- Create an index on user_name. Concern: However given the ratio of total number of users (100M) to average contacts per user (say 200), would that value be considered to have high-cardinality, hence bad for indexing?
In general, are there guideline around looking up many items (like contacts here) referred by one item (like user here) without running in wide rows or non-optimal indexes?