17
votes

I recently started working with Cassandra database. I have installed single node cluster in my local box. And I am working with Cassandra 1.2.3.

I was reading the article on the internet and I found this line-

Cassandra writes are first written to a commit log (for durability), and then to an in-memory table structure called a memtable. A write is successful once it is written to the commit log and memory, so there is very minimal disk I/O at the time of write. Writes are batched in memory and periodically written to disk to a persistent table structure called an SSTable (sorted string table).

So to understand the above lines, I wrote a simple program that will write to Cassandra Database using Pelops client. And I was able to insert the data in Cassandra database.

And now I am trying to see how my data was written into commit log and where that commit log file is? And also how SSTables is generated and where I can find that as well in my local box and what it contains also.

I wanted to see these two files so that I can understand more how Cassandra works behind the scenes.

In my cassandra.yaml file, I have something like this

# directories where Cassandra should store data on disk.
data_file_directories:
    - S:\Apache Cassandra\apache-cassandra-1.2.3\storage\data

# commit log
commitlog_directory: S:\Apache Cassandra\apache-cassandra-1.2.3\storage\commitlog

# saved caches
saved_caches_directory: S:\Apache Cassandra\apache-cassandra-1.2.3\storage\savedcaches

But when I opened commitLog, first of all it has lot of data so my notepad++ is not able to open it properly and if it gets opened, I cannot see properly because of some encoding or what. And in my data folder, I cannot find out anything?

Meaning this folder is empty for me-

S:\Apache Cassandra\apache-cassandra-1.2.3\storage\data\my_keyspace\users

Is there anything I am missing here? Can anybody explain me how to read commitLog and SSTables files and where I can find these two files? And also what exactly happens behind the scenes whenever I am writing to Cassandra database.

Updated:-

Code I am using to insert into Cassandra Database-

public class MyPelops {

    private static final Logger log = Logger.getLogger(MyPelops.class);

    public static void main(String[] args) throws Exception {


        // -------------------------------------------------------------
        // -- Nodes, Pool, Keyspace, Column Family ---------------------
        // -------------------------------------------------------------

        // A comma separated List of Nodes
        String NODES = "localhost";

        // Thrift Connection Pool
        String THRIFT_CONNECTION_POOL = "Test Cluster";

        // Keyspace
        String KEYSPACE = "my_keyspace";

        // Column Family
        String COLUMN_FAMILY = "users";

        // -------------------------------------------------------------
        // -- Cluster --------------------------------------------------
        // -------------------------------------------------------------

        Cluster cluster = new Cluster(NODES, 9160);

        Pelops.addPool(THRIFT_CONNECTION_POOL, cluster, KEYSPACE);

        // -------------------------------------------------------------
        // -- Mutator --------------------------------------------------
        // -------------------------------------------------------------

        Mutator mutator = Pelops.createMutator(THRIFT_CONNECTION_POOL);

        log.info("- Write Column -");

        mutator.writeColumn(
                COLUMN_FAMILY,
                "Row1",
                new Column().setName(" Name ".getBytes()).setValue(" Test One ".getBytes()).setTimestamp(new Date().getTime()));

        mutator.writeColumn(
                COLUMN_FAMILY,
                "Row1",
                new Column().setName(" Work ".getBytes()).setValue(" Engineer ".getBytes()).setTimestamp(new Date().getTime()));

        log.info("- Execute -");
        mutator.execute(ConsistencyLevel.ONE);

        // -------------------------------------------------------------
        // -- Selector -------------------------------------------------
        // -------------------------------------------------------------

        Selector selector = Pelops.createSelector(THRIFT_CONNECTION_POOL);

        int columnCount = selector.getColumnCount(COLUMN_FAMILY, "Row1",
                ConsistencyLevel.ONE);
        System.out.println("- Column Count = " + columnCount);

        List<Column> columnList = selector
                .getColumnsFromRow(COLUMN_FAMILY, "Row1",
                        Selector.newColumnsPredicateAll(true, 10),
                        ConsistencyLevel.ONE);
        System.out.println("- Size of Column List = " + columnList.size());

        for (Column column : columnList) {
            System.out.println("- Column: (" + new String(column.getName()) + ","
                    + new String(column.getValue()) + ")");
        }

        System.out.println("- All Done. Exit -");
        System.exit(0);
    }

}

Keyspace and Column family that I have created-

create keyspace my_keyspace with placement_strategy = 'org.apache.cassandra.locator.SimpleStrategy' and strategy_options = {replication_factor:1};
use my_keyspace;
create column family users with column_type = 'Standard' and comparator = 'UTF8Type';
1

1 Answers

38
votes

You are almost there in your understanding. However, missing some minute details.

So explaining things in a structured way, cassandra write operation life cycle is divided in these steps

  • commitlog write
  • memtable write
  • sstable write

Cassandra writes are first written to a commit log (for durability), and then to an in-memory table structure called a memtable. A write is said to successful once it is written to the commit log and memory, so there is very minimal disk I/O at the time of write. When ever the memtable runs out of space, i.e when the number of keys exceed certain limit (128 is default) or when it reaches the time duration (cluster clock), it is being stored into sstable, immutable space (This mechanism is called Flushing). Once writes are done on SSTable, then you can see the corresponding datas in the data folder, in your case its S:\Apache Cassandra\apache-cassandra-1.2.3\storage\data. Each SSTable composes of mainly 2 files - Index file and Data file

  • Index file contains - Bloom filter and Key-Offset pairs

    • Bloom Filter: A Bloom filter, is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set. False positives are possible, but false negatives are not. Cassandra uses bloom filters to save IO when performing a key lookup: each SSTable has a bloom filter associated with it that Cassandra checks before doing any disk seeks, making queries for keys that don't exist almost free
    • (Key, offset) pairs (points into data file)
  • Data file contains the actual column data

And regarding commitlog files, these are encrypted files maintained intrinsically by Cassandra, for which you are not able to see anything properly.

UPDATE:

Memtable is an in-memory cache with content stored as key/column (data are sorted by key). Each column-family has a separate Memtable and retrieve column data from the key. So now i hope you are in clear state of mind to understand the fact, why we can't locate them in our disk.

In your case your memtable is not full as memtable thresholds are not bleached yet resulting to no flushing. You can know more about MemtableThresholds here though it is recommended not to touch that Dial.

SSTableStructure:

  • Your data folder
    • KEYSPACE
      • CF
        • CompressionInfo.db
        • Data.db
        • Filter.db
        • Index.db
        • Statistics.db
        • snapshots //if snapshots are taken

For more information Refer sstable