Ideal Cassandra parameters/settings for inserting and reading streaming data

Question

I am inserting streaming data into 2 separate keyspaces with data insert into 2 column families (standard) in the first keyspace and into 3 column families (2 standard and 1 counter) in the second keyspace.

The data insert rate into these column families are well controlled and it works just fine [60% CPU utilization and CPU load factor of about 8-10] with pure writes. Next, I attempt to continuously read data from these column families via the Pycassa API while the writes are happening in parallel and I notice a severe degradation in write performance.

What system settings are recommended for parallel writes + reads from 2 keyspaces? Currently the data directory is on a single physical drive with RAID10 on each nodes.

RAM: 8GB

HeapSize: 4GB

Quad core Intel Xeon Processor @3.00 GHz

Concurrent Writes = Concurrent Reads = 16 (in cassandra.yaml file)

Data Model

Keyspace1: I am inserting time series data with time stamp (T) as the column name in a wide column that stores 24 hours worth of data in a single row.

CF1:

    Col1    |   Col2    |   Col3(DateType)  |   Col(UUIDType4)  |

RowKey1

RowKey2

:

CF2 (Wide column family):

RowKey1 (T1, V1) (T2, V3) (T4, V4) ......

RowKey2 (T1, V1) (T3, V3) .....

:

Keyspace2:

CF1:

    Col1    |   Col2    |   Col3(DateType)  |   Col4(UUIDType)  |   ...  Col10

RowKey1

RowKey2

:

CF2 (Wide column family):

RowKey1 (T1, V1) (T2, V3) (T4, V4) ......

RowKey2 (T1, V1) (T3, V3) .....

:

CF3 (Counter Column family):

Counts occurrence of every event stored in CF2.

The data is continuously read from Keyspace 1 and 2, CF2 only (wide column families). Just to reiterate, the reads and writes are happening in parallel. The amount of data queried increases incrementally from 1 to 8 rowkeys using multiget and this process repeats.

Can you provide details about what your data model is and what the writes and reads (especially reads) are doing? — Tyler Hobbs
Thanks. I missed this at first, but is your commit log on the same RAID as the data? That could explain part of the impact; check disk util. Otherwise, take a look at CPU utilization and GC activity. — Tyler Hobbs
The commit log and data are on different physical drives. Disk utilization and disk queue looks normal. We reduced the buffer size in the pycassa multiget function call and it seems to be performing okie! We stress test for 24-48 hours, hopefully it holds on. CPU utilization is now around 60-70% on all nodes.The multiget call always proves to be inefficient at higher degrees of parallelism. Time to rewrite the client differently! — vinay sudhakar
The number of pending writes and the number of pending compactions increases significantly on 1 particular node and it subsequently stops accepting writes. Does continuous reads have an impact on compactions in a typical mixed (read+write) workload scenario? I am using Cassandra 1.2.3. — vinay sudhakar

vinay sudhakar vinay sudhakar · Accepted Answer · 2014-02-23T16:15:15

Possible ways to overcome the issue:

Increased the space allocated to younger generation as recommended in this blog post: http://tech.shift.com/post/74311817513/cassandra-tuning-the-jvm-for-read-heavy-workloads
Made small schema updates and dropped unnecessary secondary indexes. This decreased the compaction overheads.
Reduced the write timeout to 2s in cassandra.yaml as recommended in my previous post: Severe degradation in Cassandra Write performance with continuous streaming data over time

The read client still needs an update to avoid the use of multiget at high workloads. The above improvements have significantly improved the performance.

Ideal Cassandra parameters/settings for inserting and reading streaming data

Data Model

1 Answers