I am building an application which process very large data(more that 3 million).I am new to cassandra and I am using 5 node cassandra cluster to store data. I have two column families
Table 1 : CREATE TABLE keyspace.table1 (
partkey1 text,
partkey2 text,
clusterKey text,
attributes text,
PRIMARY KEY ((partkey1, partkey2), clusterKey1)
) WITH bloom_filter_fp_chance = 0.01
AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
AND comment = ''
AND compaction = {'min_threshold': '4', 'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32'}
AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99.0PERCENTILE';
Table 2 : CREATE TABLE keyspace.table2 (
partkey1 text,
partkey2 text,
clusterKey2 text,
attributes text,
PRIMARY KEY ((partkey1, partkey2), clusterKey2)
) WITH bloom_filter_fp_chance = 0.01
AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
AND comment = ''
AND compaction = {'min_threshold': '4', 'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32'}
AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99.0PERCENTILE';
note : clusterKey1 and clusterKey2 are randomly generated UUID's
My concern is on nodetool cfstats I am getting good throughput on Table1 with stats :
- SSTable count: 2
- Space used (total): 365189326
- Space used by snapshots (total): 435017220
- SSTable Compression Ratio: 0.2578485727722293
- Memtable cell count: 18590
- Memtable data size: 3552535
- Memtable switch count: 171
- Local read count: 0
- Local read latency: NaN ms
- Local write count: 2683167
- Local write latency: 1.969 ms
- Pending flushes: 0
- Bloom filter false positives: 0
- Bloom filter false ratio: 0.00000
- Bloom filter space used: 352
where as for table2 I am getting very bad read performance with stats :
- SSTable count: 33
- Space used (live): 212702420
- Space used (total): 212702420
- Space used by snapshots (total): 262252347
- SSTable Compression Ratio: 0.1686948750752438
- Memtable cell count: 40240
- Memtable data size: 24047027
- Memtable switch count: 89
- Local read count: 24027
- Local read latency: 0.580 ms
- Local write count: 1075147
- Local write latency: 0.046 ms
- Pending flushes: 0
- Bloom filter false positives: 0
- Bloom filter false ratio: 0.00000
- Bloom filter space used: 688
I was wondering why table2 is creating 33 SSTables and why is the read performance very low in it. Can anyone help me figure out what I am doing wrong here?
This is how I query the table :
BoundStatement selectStamt;
if (selectStamt == null) {
PreparedStatement prprdStmnt = session
.prepare("select * from table2 where clusterKey1 = ? and partkey1=? and partkey2=?");
selectStamt = new BoundStatement(prprdStmnt);
}
synchronized (selectStamt) {
res = session.execute(selectStamt.bind("clusterKey", "partkey1", "partkey2"));
}
In another thread, I am doing some update operations on this table on different data the same way.
In case of measuring throughput, I measuring number of records processed per sec and its processing only 50-80 rec.