0
votes

We are trying to store lots of attributes for a particular profile_id inside a table (using CQL3) and cannot wrap our heads around which approach is the best:

a. create table mytable (profile_id, a1 int, a2 int, a3 int, a4 int ... a3000 int) primary key (profile_id);

OR

b. create MANY tables, eg. create table mytable_a1(profile_id, value int) primary key (profile_id); create table mytable_a2(profile_id, value int) primary key (profile_id); ... create table mytable_a3000(profile_id, value int) primary key (profile_id);

OR

c. create table mytable (profile_id, a_all text) primary key (profile_id); and just store 3000 "columns" inside a_all, like: insert into mytable (profile_id, a_all) values (1, "a1:1,a2:5,a3:55, .... a3000:5");

OR

d. none of the above

The type of query we would be running on this table: select * from mytable where profile_id in (1,2,3,4,5423,44)

We tried the first approach and the queries keep timing out and sometimes even kill cassandra nodes.

1

1 Answers

2
votes

The answer would be to use a clustering column. A clustering column allows you to create dynamic columns that you could use to hold the attribute name (col name) and it's value (col value).

The table would be

create table mytable ( 
    profile_id text,
    attr_name text,
    attr_value int,
    PRIMARY KEY(profile_id, attr_name)
)

This allows you to add inserts like

insert into mytable (profile_id, attr_name, attr_value) values ('131', 'a1', 3);
insert into mytable (profile_id, attr_name, attr_value) values ('131', 'a2', 1031);
.....
insert into mytable (profile_id, attr_name, attr_value) values ('131', 'an', 2);

This would be the optimal solution.

Because you then want to do the following 'The type of query we would be running on this table: select * from mytable where profile_id in (1,2,3,4,5423,44)'

This would require 6 queries under the hood but cassandra should be able to do this in no time especially if you have a multi node cluster.

Also if you use the DataStax Java Driver you can run this requests asynchronously and concurrently on your cluster.

For more on data modelling and the DataStax Java Driver check out DataStax's free online training. Its worth a look http://www.datastax.com/what-we-offer/products-services/training/virtual-training

Hope it helps.