cassandra wide row column slice performance

Question

I am testing cql/cassandra 1.2 and the python-cql library on a vm with 2GB ram. I have a table with a compound index (wide row). When running queries against a single node I am getting about 10x worse performance than mysql. Requests are serial with no concurrency but I am interested in the speed of a single request.

Most importantly, Is there anything I can do to optimize querying wide rows (specifically this query)?
Are these numbers reflective of cassandra vs mysql performance in a single request situation?
Could my limited ram/vm be making this big of a difference?
Would multi-node cassandra / partitioned mysql be closer than 10x?
Am I doing something horribly wrong?

Test code:

"""
CREATE TABLE foo_bars (
     foo_id text,
     bar_id bigint,
     content text,
     PRIMARY KEY (foo_id, bar_id)
)
WITH CLUSTERING ORDER BY (bar_id DESC);
"""

#content is up to 64k text and te number of bar columns in a foo row will be ever growing but will probably never reach over 2million


t1 = time.time()
for i in range(1, 1000):
    sql_query = "SELECT * FROM foo_bars WHERE foo_id IN(%s) ORDER BY id DESC LIMIT 40" % random_foo_ids
    result = db_cursor.execute(sql_query)
t2 = time.time()
print "Sql time = %s" % str(t2 - t1)


t1 = time.time()
for i in range(1, 1000):
    cql_query = "SELECT * FROM foo_bars WHERE foo_id IN(%s) LIMIT 40" % radom_foo_ids
    result = cassandra_cursor.execute(cql_query)
t2 = time.time()
print "Cql time = %s" % str(t2 - t1)

Sql time = 4.2
Cql time = 58.7

Thanks In Advance!

How big is your column family? The output of nodetool cfstats 'space used (live)' is the best indicator. — Richard
*Space used (live): 31749778 *Space used (total): 31749778 *Compacted row minimum size: 447 *Compacted row maximum size: 654949 *Compacted row mean size: 68740 — user2537952
That's 31 MB, so it easily fits into cache. It can't be related to memory then. It might simply be that Cassandra's read latency is higher than for MySQL. The throughput may be higher though but you'll need concurrency for that. — Richard
Thanks, I was worried about that :( I was considering manipulating column_index_size_in_kb (currently set at 64) but I don't know what a good value would be for my use case. — user2537952
That shouldn't make any difference, since you are reading the first 40 columns from a row. The column index only helps when reading specific columns from a large row. — Richard

Richard Richard · Accepted Answer · 2013-07-01T15:25:35

You might get it a bit faster by enabling the row cache. Set row_cache_size_in_mb in cassandra.yaml to something larger than your CF size - so 100 would work. Then set caching = 'all' for your column family. As you read, you should see the hit rate increase as reported by nodetool info.

However, I doubt you will get anything like 10x speed up.

The problem is that Cassandra (in particular reads) is built for high throughput rather than low latency. There any lots of queues inside Cassandra that add to latency. Adding more nodes will further increase latency (but increasing number of nodes much beyond the replication factor shouldn't increase latency further), but give an approximately linear improvement to throughput.

The solution is to use concurrency: either queues, threads and multiple connections in your single client, or multiple clients. But if that's not possible for your use case I expect MySQL will be faster for this kind of read. Indeed, if you are only expecting to have 31 MB of data MySQL is probably better for your use case anyway.

cassandra wide row column slice performance

1 Answers