Cassandra random read speed

Question

We're still evaluating Cassandra for our data store. As a very simple test, I inserted a value for 4 columns into the Keyspace1/Standard1 column family on my local machine amounting to about 100 bytes of data. Then I read it back as fast as I could by row key. I can read it back at 160,000/second. Great.

Then I put in a million similar records all with keys in the form of X.Y where X in (1..10) and Y in (1..100,000) and I queried for a random record. Performance fell to 26,000 queries per second. This is still well above the number of queries we need to support (about 1,500/sec)

Finally I put ten million records in from 1.1 up through 10.1000000 and randomly queried for one of the 10 million records. Performance is abysmal at 60 queries per second and my disk is thrashing around like crazy.

I also verified that if I ask for a subset of the data, say the 1,000 records between 3,000,000 and 3,001,000, it returns slowly at first and then as they cache, it speeds right up to 20,000 queries per second and my disk stops going crazy.

I've read all over that people are storing billions of records in Cassandra and fetching them at 5-6k per second, but I can't get anywhere near that with only 10mil records. Any idea what I'm doing wrong? Is there some setting I need to change from the defaults? I'm on an overclocked Core i7 box with 6gigs of ram so I don't think it's the machine.

Here's my code to fetch records which I'm spawning into 8 threads to ask for one value from one column via row key:

ColumnPath cp = new ColumnPath(); cp.Column_family = "Standard1"; cp.Column = utf8Encoding.GetBytes("site"); string key = (1+sRand.Next(9)) + "." + (1+sRand.Next(1000000)); ColumnOrSuperColumn logline = client.get("Keyspace1", key, cp, ConsistencyLevel.ONE);

Thanks for any insights

jbellis jbellis · Accepted Answer · 2010-06-17T15:00:14

purely random reads is about worst-case behavior for the caching that your OS (and Cassandra if you set up key or row cache) tries to do.

if you look at contrib/py_stress in the Cassandra source distribution, it has a configurable stdev to perform random reads but with some keys hotter than others. this will be more representative of most real-world workloads.

Cassandra random read speed

4 Answers