1
votes

This is more of a design question.

Let's imagine I have this table (most columns omitted for readability):

CREATE TABLE IF NOT EXISTS users (
userid uuid,
emailaddress text,
passwordhash text,
passwordsalt text,
datecreated timestamp,
PRIMARY KEY (userid, emailaddress)
);

On this table, I need to sometimes get userid by emailaddress and other times, simply use the userid to get the rest of the values, so the queries are:

  1. Search by emailaddress, get userid and use in other places
  2. Search by userid, get all the values for specific user and use that elsewhere.

Querying by only userid works since it's the first key in the composite key table but for querying by emailaddress (not knowing the userid in advance), I will have to turn on the ALLOW FILTERING which is strongly discouraged.

Question is: knowing this design is not good in terms of how Cassandra operates, how else would I achieve my goal (to get user details by EITHER userid OR emailaddress)? Should I create 2 tables such as these?

CREATE TABLE IF NOT EXISTS users (
userid uuid,
passwordhash text,
passwordsalt text,
datecreated timestamp,
PRIMARY KEY (userid)
);

CREATE TABLE IF NOT EXISTS useremails (
emailaddress text,
userid uuid,
PRIMARY KEY (emailaddress)
);

Or is there a cleaner way to achieve what I want to do? I apologize in advance for my lack of understanding the ins and outs of Cassandra. I'm still learning it and trying to approach the problems from the correct angle instead of hacking (which ALLOW FILTERING would allow me to do).

Many thanks,

2

2 Answers

2
votes

Cassandra stores data in partitions, and the first key in the primary key definition is the partition key. You want your queries to hit a single (or at least n number of) partition(s), instead of doing a cluster wide search (which is what allow filtering is allowing).

Your idea of having a second table is a good one. I'd go with that. Remember, two 1ms queries is better than one 2s query :)

1
votes

Your option of managing a second table with the index by yourself is a nice option if you expect a very high volume of users. If the expected number of users is somewhat lower (~10-50 million) you can use a secondary index too. This options are discussed in the Cassandra 1.1 documentation. Since you seems to be using Cassandra 2.0 you probably want to have a look at the Cassandra 2.0 documentation too:

When to Use Secondary Indexes

Cassandra's built-in secondary indexes are best on a column family having many rows that contain the indexed value. The more unique values that exist in a particular column, the more overhead you will have, on average, to query and maintain the index. For example, suppose you had a user table with a billion users and wanted to look up users by the state they lived in. Many users will share the same column value for state (such as CA, NY, TX, etc.). This would be a good candidate for a secondary index.

When Not to Use Secondary Indexes

Do not use secondary indexes to query a huge volume of records for a small number of results. For example, if you create indexes on columns that have many distinct values, a query between the fields will incur many seeks for very few results. In the column family with a billion users, looking up users by their email address (a value that is typically unique for each user) instead of by their state, is likely to be very inefficient. It would probably be more efficient to manually maintain a dynamic column family as a form of an index instead of using a secondary index. For columns containing unique data, it is sometimes fine performance-wise to use secondary indexes for convenience, as long as the query volume to the indexed column family is moderate and not under constant load.

Building and Using Secondary Indexes

An advantage of secondary indexes is the operational ease of populating and maintaining the index. Secondary indexes are built in the background automatically, without blocking reads or writes. Client-maintained column families as indexes must be created manually; for example, if the state column had been indexed by creating a column family such as users_by_state, your client application would have to populate the column family with data from the users column family.

As you can see, you have 2 options, use the secondary index to have it working right away without having to change too much code if the expected size of the table is rather small or go with the 2 tables setup if the application if too big or if you expect it to scale to that level in the future.

Always try to avoid at all cost the use of ALLOW FILTERING,

Hope it helps!