1
votes

I am trying to model a cassandra data set for a time series where I capture the sum of bytes over a minute for a given user, using a counter column. Using CQL3 I have this create table syntax:

CREATE TABLE minute_usr (
min varchar,
usr varchar,
bytes counter,
PRIMARY KEY (min, usr)
)

I am able to update the column using this query:

UPDATE minute_usr SET bytes = bytes + 200 WHERE usr = 'testuser' AND min = '1369448220';

however I would like to now get back the value of 'bytes' for a range of minutes where usr = 'username':

select min, bytes from minute_usr WHERE usr = 'testuser' AND min >= '1369448160' and min <= '1369448220';

and I am getting back the error:

Bad Request: Only EQ and IN relation are supported on the partition key for random partitioners (unless you use the token() function)

I was under the assumption that CQL was creating a composite key out of 'min' and 'usr' when creating the table allowing me to get a range of composite keys. The cli is actually telling me that the key is actually just the value of 'min':

[default@data_use] list minute_usr; 
Using default limit of 100 
Using default column limit of 100  
RowKey: 1369448220
=> (column=testuser:, value=, timestamp=1371066133370000)
=> (counter=testuser:bytes, value=1400)

 RowKey: 1369448160
=> (column=testuser:, value=, timestamp=1371066138506000)
=> (counter=testuser:bytes, value=1600)

Am I simply limited to doing an individual query for each timestamp / username combination with this data model?

I have also tried modeling the data with the username as the first part of the key, which allows me to do range queries

CREATE TABLE usr_minute (
min varchar,
usr varchar,
bytes counter,
PRIMARY KEY (usr, min)
)

And then I could do a range query on the column names for minutes like so:

`select bytes from usr_minute WHERE usr = 'testuser' AND min >= '1369448160' and min <= '1369448220';`

however I know that these values are now being stored in a single row which has limitations in terms of number of columns that can be stored, and I would like to let this data set grow forever.

Not sure how to proceed.

2

2 Answers

4
votes

If you want a composite partition key, you need extra brackets:

CREATE TABLE minute_usr (
  min varchar,
  usr varchar,
  bytes counter,
  PRIMARY KEY ((min, usr))
);

However, then you can't do range queries on min or usr. You can only do range queries on non-partition keys.

You can use your second model, with usr as the partition key, and shard on some time bucket to stop the rows growing too large. For example, you could have a partition per user per day:

CREATE TABLE usr_day_minute (
  day varchar,
  min varchar,
  usr varchar,
  bytes counter,
  PRIMARY KEY ((usr, day), min)
);

but now you will need to do separate queries per day if you want results for multiple days. You could choose a larger time bucket, but at the cost of larger rows.

Note you probably want to use a numeric data type or timestamp for min so you do numeric comparisons rather than string comparison.

2
votes

I don't think you really need to worry about row size. You can grow a single partition to 2 billion cells, so you are probably okay.

If you are really expecting data distribution to be bad, just insert an enum that is populated at random.

create table network_usage (
    usr varchar,
    bucket_enum int,
    when timestamp,
    bytes counter,
    PRIMARY KEY ((usr, bucket), when)
);

Now, you could have bucket_enum be derived from some point in time in timestamp, but that means that for certain windows of time, all the data will be in one node, which I'd imagine would be a concern. Let the bucket_enum be sized based on roughly how much you want the data sharded up. You can potentially just use the number of partitions in the database (and query at runtime for it).

Then, to update consumption information you might do something like: #language of your choice ts = now() bucket = random_integer() % sharding_factor;

#now in CQL
consistency any;
update network_usage set bytes = bytes + 200
    where usr = 'testuser' and bucket_enum = :bucket and when = now();

The bucket is really just an abuse of CQL to have the primary key sharded across the nodes in the database. Now we exploit that at query time. Let's say the sharding factor is 6:

#you may want a different consistency level, but since this is mostly historical data,
#one should really be enough.
consistency one;
select count from network_usage
where usr = 'testuser' AND
    bucket_enum in (0, 1, 2, 3, 4, 5) AND
    when >= :start_time and
    when < :end_time;

The different from the other approach is that you can control exactly how much data distribution there is, you can easily rebalance the data whenever you want, you don't have a hot spot partition, and you are spreading the data and the query processing load over as many nodes in the cluster as you want. The disadvantage is you are spreading the data and query processing load over as many nodes in the cluster as you want. ;-)

This approach is particularly helpful if you do your queries as Hadoop/Spark jobs, and it allows for completely flexible time resolution.