Bad performance when writing log data to Cassandra with timeuuid as a column name

Question

Following the pointers in an ebay tech blog and a datastax developers blog, I model some event log data in Cassandra 1.2. As a partition key, I use “ddmmyyhh|bucket”, where bucket is any number between 0 and the number of nodes in the cluster.

The Data model

cqlsh:Log> CREATE TABLE transactions (yymmddhh varchar, bucket int, rId int, created timeuuid, data map, PRIMARY KEY((yymmddhh, bucket), created) );

(rId identifies the resource that fired the event.) (map is are key value pairs derived from a JSON; keys change, but not much)

I assume that this translates into a composite primary/row key with X buckets per hours. My column names are than timeuuids. Querying this data model works as expected (I can query time ranges.)

The problem is the performance: the time to insert a new row increases continuously. So I am doing s.th. wrong, but can't pinpoint the problem.

When I use the timeuuid as a part of the row key, the performance remains stable on a high level, but this would prevent me from querying it (a query without the row key of course throws an error message about "filtering").

Any help? Thanks!

UPDATE

Switching from the map data-type to a predefined column names alleviates the problem. Insert times now seem to remain at around <0.005s per insert.

The core question remains: How is my usage of the "map" datatype in efficient? And what would be an efficient way for thousands of inserts with only slight variation in the keys.

My keys I use data into the map mostly remain the same. I understood the datastax documentation (can't post link due to reputation limitations, sorry, but easy to find) to say that each key creates an additional column -- or does it create one new column per "map"?? That would be... hard to believe to me.

Interestingly enough, when I increase the range of the "bucket" value (0-299 instead of 0-2), this seems to improve the situation significantly. When previously, the performance decrease appeared after the first 100s of inserts, I have currently inserted 20k+ rows w/o any visible problems.. — John
Anyway this does not fix the issue. As far as I see right now, it just postpones it. Is the only solution to use (almost) unique row keys for this? Or do I have to look in a different direction (server config?)? — John
be careful with the MAP column, if you insert more than 65K items into it you won't be able to read them all back. You probably want to model it differently (which shouldn't be a problem since your only non-primary-key cell is the MAP field). You can probably use two VARCHAR cells instead, but you'd have to post some more info about how you use the table for me to be sure. — Theo
Theo, thanks for the pointer! The MAP column was the source of my troubles. I receive data as a json and didn't want to store a json-string in a C* cell but convert it to a C*-native data type. The set of keys in my map changes, but not very often... How could using a MAP column result in this performance loss? I thought internally Cassandra stores each (map-)key as a new column? Could you answer below? I would like to award you the points! Thanks again. — John
I have no idea how using a MAP could cause performance problems, it shouldn't as far as I know. I've added some kind of answer below, but I don't think it is really the answer to your question, but maybe it gives you some more insight into what's going on so that you can find the real answer. — Theo

Theo Theo · Accepted Answer · 2013-06-15T11:39:16

I suggest you model your rows a little differently. The collections aren't very good to use in cases where you might end up with too many elements in them. The reason is a limitation in the Cassandra binary protocol which uses two bytes to represent the number of elements in a collection. This means that if your collection has more than 2^16 elements in it the size field will overflow and even though the server sends all of the elements back to the client, the client only sees the N % 2^16 first elements (so if you have 2^16 + 3 elements it will look to the client as if there are only 3 elements).

If there is no risk of getting that many elements into your collections, you can ignore this advice. I would not think that using collections gives you worse performance, I'm not really sure how that would happen.

CQL3 collections are basically just a hack on top of the storage model (and I don't mean hack in any negative sense), you can make a MAP-like row that is not constrained by the above limitation yourself:

CREATE TABLE transactions (
  yymmddhh VARCHAR,
  bucket INT,
  created TIMEUUID,
  rId INT,
  key VARCHAR,
  value VARCHAR,
  PRIMARY KEY ((yymmddhh, bucket), created, rId, key)
)

(Notice that I moved rId and the map key into the primary key, I don't know what rId is, but I assume that this would be correct)

This has two drawbacks over using a MAP: it requires you to reassemble the map when you query the data (you would get back a row per map entry), and it uses a litte more space since C* will insert a few extra columns, but the upside is that there is no problem with getting too big collections.

In the end it depends a lot on how you want to query your data. Don't optimize for insertions, optimize for reads. For example: if you don't need to read back the whole map every time, but usually just read one or two keys from it, put the key in the partition/row key instead and have a separate partition/row per key (this assumes that the set of keys will be fixed so you know what to query for, so as I said: it depends a lot on how you want to query your data).

You also mentioned in a comment that the performance improved when you increased the number of buckets from three (0-2) to 300 (0-299). The reason for this is that you spread the load much more evenly thoughout the cluster. When you have a partition/row key that is based on time, like your yymmddhh, there will always be a hot partition where all writes go (it moves throughout the day, but at any given moment it will hit only one node). You correctly added a smoothing factor with the bucket column/cell, but with only three values the likelyhood of at least two ending up on the same physical node are too high. With three hundred you will have a much better spread.

Bad performance when writing log data to Cassandra with timeuuid as a column name

UPDATE

2 Answers