1
votes

If I want to be able to query a table in Cassandra quickly and extract the largest values from it, how should one design the column family for it?

For example, if I have a table with peoples ID as and I also store for each person their speed, how I design the table such that I can query the slowest people as fast as I can?

Do we need to make speed a primary key? Or create an index on it or something?

How do I know when to choose secondary index over primary keys, in this example?

Is it possible to extract, say top k people that are the slowest by just making the speed a secondary index?

I guess my question also tries to address how things are sorted in Casssandra.

1

1 Answers

2
votes

In the post-relational (NoSQL) database world, normalization is not the standard the way it is in the relational world, so don't be afraid to write the same data in multiple places.

So in this case, if you want to be able to find the highest value fast, make a separate table that holds the highest value (or k values) and update it whenever you're updating the existing table with a value that's larger.

Optimize your tables for queries, and adjust what you write as needed.

As for the key - I'm picturing perhaps a table/cf with a single row with a key value like "slowestusers" and a value or series of values on that row.

CREATE TABLE slow_responses (
    id varchar PRIMARY KEY,
    slowest_response1 bigint,
    slowest_response2 bigint
);

INSERT INTO slow_responses (id, slowest_response1) VALUES ('slowestusers', 1200);
INSERT INTO slow_responses (id, slowest_response2) VALUES ('slowestusers', 1301);

There would be issues if you have more than one server updating these values, and that would have to be considered if you require it.

Edit:

Further thoughts on how to do a multi-server update on this. One is that you could serialize these updates by throwing them in a queue using your favorite q'ing implementation. I thought of another way, then realized it won't work. ;) But surely, there are options.

Edit:

Okay, maybe it will work. The other idea is to have each server write its own row. Then, on read, you pull in all the rows, sort all the numbers together, and pull the k highest values.

CREATE TABLE slow_responses (
    id varchar,
    server varchar,
    slowest_response1 bigint,
    slowest_response2 bigint,
    PRIMARY KEY (id, server)
);

-- server1 writes like:
INSERT INTO slow_responses (id, server, slowest_response1) VALUES ('slowestusers', 'node1', 1200);
INSERT INTO slow_responses (id, server, slowest_response2) VALUES ('slowestusers', 'node1', 1301);

-- server2 writes like:
INSERT INTO slow_responses (id, server, slowest_response1) VALUES ('slowestusers', 'node2', 800);
INSERT INTO slow_responses (id, server, slowest_response2) VALUES ('slowestusers', 'node2', 765);