We are in the process of researching a move to Cassandra (2.0.10) and we are testing the write and read performance.
While reading we are seeing what seems to be low read throughput, 14MB/s on avg.
Our current testing environment is only one node, Xeon E5-1620 @ 3.7GHZ with 32GB of RAM, windows 7. Cassandra heap was set to 8GB with default concurrent read and writes, key cache size is set to 400mb, the data sits on a local RAID10 array which is doing sustained avg of 300MB/s sequential read performance using 64KB and higher block sizes.
We are storing hourly sensor data with the current model:
CREATE TABLE IF NOT EXISTS sensor_data_by_day (
sensor_id int,
date text,
event_time timestamp,
load float,
PRIMARY KEY ((sensor_id,date),event_time))
Reading is done on the sensor, date and a range of event time.
Current data set is 2 years worth of data for 100K sensors, about 30GB on disk.
Data is inserted by numerous threads (So the inserts are not sorted by event time, if that matters)
Reading back a day worth of data takes about 2m with a throughput of 14MB/s. Reading is done using the java-cassandara-connector with a prepared statement:
Select event_time, load from sensor_data_by_day where sensor_id = ? and date in ('2014-02-02') and event_time >= ? and event_time < ?
We create one connection and submitting tasks (100K queries as the number of sensors) to an executor service with pool of 100 threads. Reading when the data is in the cache takes about 7s.
It's probably not a client problem, we tested when the data was located on an SSD and the total time went down from 2m to 10s (~170MB/s), which is understandably better given it's an SSD.
The read performance looks like a block read size issue, which can cause the low reads if Cassandra was reading in 4KB blocks. I read the default was 256 but didn't find the setting anywhere to confirm it or perhaps a random I/O issue?
Is this the kinds of read performance you should expect from Cassandra when using mechanical disks? Perhaps a modeling problem?
Output of cfhistograms:
SSTables per Read
1 sstables: 844726
2 sstables: 90
Write Latency (microseconds)
No Data
Read Latency (microseconds)
5 us: 418
6 us: 15252
7 us: 12884
8 us: 15447
10 us: 34211
12 us: 48972
14 us: 48421
17 us: 56641
20 us: 12484
24 us: 8325
29 us: 6602
35 us: 4953
42 us: 5427
50 us: 3610
60 us: 1784
72 us: 2414
86 us: 11208
103 us: 38395
124 us: 82050
149 us: 64840
179 us: 40161
215 us: 30891
258 us: 17691
310 us: 8787
372 us: 4171
446 us: 2305
535 us: 1588
642 us: 1187
770 us: 913
924 us: 811
1109 us: 716
1331 us: 602
1597 us: 513
1916 us: 513
2299 us: 516
2759 us: 595
3311 us: 776
3973 us: 1086
4768 us: 1502
5722 us: 2212
6866 us: 3264
8239 us: 4852
9887 us: 7586
11864 us: 11429
14237 us: 17236
17084 us: 22285
20501 us: 26163
24601 us: 26799
29521 us: 24311
35425 us: 22101
42510 us: 19420
51012 us: 16497
61214 us: 13830
73457 us: 11356
88148 us: 8749
105778 us: 6243
126934 us: 4406
152321 us: 2751
182785 us: 1754
219342 us: 977
263210 us: 497
315852 us: 233
379022 us: 109
454826 us: 60
545791 us: 21
654949 us: 10
785939 us: 2
943127 us: 0
1131752 us: 1
Partition Size (bytes)
179 bytes: 151874
215 bytes: 0
258 bytes: 0
310 bytes: 0
372 bytes: 5071
446 bytes: 0
535 bytes: 4170
642 bytes: 3724
770 bytes: 3454
924 bytes: 3416
1109 bytes: 3489
1331 bytes: 9179
1597 bytes: 11616
1916 bytes: 12435
2299 bytes: 19038
2759 bytes: 20653
3311 bytes: 10245454
3973 bytes: 25121333
Cell Count per Partition
4 cells: 151874
5 cells: 0
6 cells: 0
7 cells: 0
8 cells: 5071
10 cells: 0
12 cells: 4170
14 cells: 0
17 cells: 3724
20 cells: 3454
24 cells: 3416
29 cells: 3489
35 cells: 3870
42 cells: 9982
50 cells: 13521
60 cells: 20108
72 cells: 16678
86 cells: 51646
103 cells: 35323903
IN
operator really isn't optimized for performance. You would probably do better withdate=
instead ofdate IN
. – AaronTRACING
for you query as well (datastax.com/documentation/cql/3.1/cql/cql_reference/…) – Mikhail Stepura