Cassandra performance slow with secondary indexes

Question

We have a test code schema which uses a java client to do Cassandra INSERT/READ/QUERY operations. We have built single node setup with physical server with following configuration.

OS is Linux SuSE 11.SP2
Memory on physical server is 32GB
Swap memory is 32GB
CPU has 4 core with each 2GHz
Commit log Residing on SSD disk with 100GB (RAID-0 and local to system)
Data log residing on SAS disk with 7TB (5 SAS disks configured with RAID-0 and local to system).
JRE version 1.7.0.25
Cassandra Version 1.2.5 (Default partition)
MAX HEAP SIZE 8GB
HEAP_NEW_SIZE 400MB ( 100MB per core as per Cassandra recommendation).

NOTE Increasing CPU from 4 core to 8 core helped to improve the performance but very less.

We are using below test schema which has 5 secondary indexes.

"CREATE TABLE test_table (
  hash_key text PRIMARY KEY,
  ctime timestamp,
  ctime_bucket bigint,
  extension text,
  filename text,
  filename_frag text,
  filesize bigint,
  filesize_bucket bigint,
  hostname text,
  mtime timestamp,
  mtime_bucket bigint
) WITH
  bloom_filter_fp_chance=0.010000 AND
  caching='KEYS_ONLY' AND
  comment='' AND
  dclocal_read_repair_chance=0.000000 AND
  gc_grace_seconds=864000 AND
  read_repair_chance=0.100000 AND
  replicate_on_write='true' AND
  populate_io_cache_on_flush='false' AND
  compaction={'class': 'SizeTieredCompactionStrategy'} AND
  compression={'sstable_compression': 'SnappyCompressor'};

CREATE INDEX test_table_ctime_bucket_idx ON test_table (ctime_bucket);
CREATE INDEX test_table_extension_idx ON test_table (extension);
CREATE INDEX test_table_filename_frag_idx ON test_table (filename_frag);
CREATE INDEX test_table_filesize_bucket_idx ON test_table (filesize_bucket);
CREATE INDEX test_table_mtime_bucket_idx ON test_table (mtime_bucket);"

We are trying following INSERT and READ tests with default tuning parameters however we are seeing very slow in read and write performance. The read is drastically slow compared to write performance. When we removed the secondary indexes from above schema we get around 2x time better performance however still we feel there is scope to improve the performance with tuning Cassandra parameters. However with secondary indexes the performance is very bad.

1M INSERT provides around 7k Ops/sec
10M INSERT provides around 5K Ops/sec (slightly drops the performance)
100M INSERT provides around 5K Ops/sec
1000MM INSERT provides around 4.5K Ops/sec

If we remove the secondary indexes we get performance around 11K Ops/sec for all workloads listed above.

1M READ provides around : 4.5k Ops/sec
10M READ provides only around : 225 ops/sec (drastically drops the performance)

We would like to know from your expert team about what specific tuning parameters to be applied for WRITE and READ operations to get better performance. How can we defer the compaction and GC to avoid the performance bottleneck which can play some role during these operations. If there are any system specific tunings to be applied, we would like to know from your expert team.

We are trying with following tuning parameters (in Cassandra.yaml and Cassandra-env.sh) however we have not getting much improvement in write and read performance.

jbellis jbellis · Accepted Answer · 2013-08-27T19:03:18

This is a pretty textbook case of being i/o bound, especially with performance going down with larger datasets. iostat can confirm this.

You need to switch to SSDs, add machines to your cluster, or reduce the randomness of your workload (increasing caching effectiveness).

Edit: I note that you have the commitlog on SSD. The commitlog is purely sequential writes and thus does not benefit from being on SSD very much. Put the commitlog on one of your HDD and the data files on SSD instead.

Cassandra performance slow with secondary indexes

1 Answers