0
votes

Folks,

We were trying to evaluate CASSANDRA for one of the production application. We had few basic queries which we would like to understand before going forward.

WRITE :

Cassandra uses consistent hashing mechanism to distribute key evenly across nodes. So some key will be available on some Cassandra node.

We further understood that there will be internal SSTTable structure created to store this data within the node.

READ :

While performing a read client will send request to any Cassandra node cluster and based on consistent hashing Cassandra will determine where the key is located on which node.

Following things are not clear.

1) How many SSTTables are created for given key space/column family on a node ( is it some fix number or only 1)

2) Cassandra document describes that there is some broom filter(alternative to standard hashing) which is used to determine whether given key is present in the SSTtable or not ( What if there are 1000 SSTtables there will be 1000 bloom filter which will be checked to determine whether key is present or not.)

1

1 Answers

0
votes

1) Number of sstables depend on the compaction strategy and load. To get an idea check out log structured merge trees to have a basic understanding then look at the different compaction strategies (size tiered, leveled, date tiered).

2) Yes there is 1 bloom filter per sstable to give a probabilistic membership of a partition existing in that sstable. Size of bloom filter depends on the number of partitions and the target false positives percentage. They are kept off heap and are generally small, so less a concern now a days than as earlier versions.

Checking out the dynamo and big table papers may help in understanding the principals behind the clustering and storage. There is a lot of free resources on the read/write path and too much to fully go over in a stack overflow question so I would recommend going through some material at the datastax academy or some presentations on youtube.