0
votes

We are trying to build a data-model for Cassandra and there is an option to save data with kind of path view instead of classic table/column view.

For example

CREATE TABLE data1 (
  user uuid,
  timestamp timeuuid,
  column1 int,
  column2 int
  primary key (user, timestamp)
)
CREATE TABLE `data2` (
  user uuid,
  timestamp timeuuid,
  column1 int,
  column2 int
  primary key (user, timestamp)
)

becomes one table like:

create table all_data (
  user uuid,
  timestamp timeuuid,
  path text,
  value blob
  primary key (user, timestamp, path)
)

where we insert data as

insert into all_data (user, timestamp, path, value) values (1,0,'data1.column1',1)
insert into all_data (user, timestamp, path, value) values (1,0,'data1.column2',2)
insert into all_data (user, timestamp, path, value) values (1,9,'data2.column1',7)
insert into all_data (user, timestamp, path, value) values (1,9,'data2.column1',8)

It's somewhat similar to the way graphite collects statsd metrics and arguments to such model are to have kind of column based storage with better read performance for one column, since we almost never need to get whole row, but rather one column per user within time-range.

Does anyone have an experience with such data model and does it really have a better performance over classic one? Can you give any advice on this?

1
How can we answer about your data modelling when you haven't described the data that you're trying to model?OrangeDog

1 Answers

0
votes

Depends on what data you're after.
If you ever want to get all the columns for [user, timestamp] then you'll want to use the first model. One query (WHERE user=1 AND timestamp=0) will look at one node to retrieve all columns.
To use the second model, you'll have to execute a query for every possible combination of user/timestamp/path. If only column1 & column2 exist, the you would need 2 queries to retrieve them (WHERE user=1 AND timestamp=0 AND path='data1.column1') (WHERE user=1 AND timestamp=0 AND path='data1.column2'). Every row would potentially be on a different node.
The second model would be less performant.