cassandra: inconsistent column families

Question

Taken from the example in http://www.ibm.com/developerworks/library/os-apache-cassandra/. Suppose we are concerning with two entities: books and tags. A book has multiple tags, so the relationship between them is 1:M.

According to the article, we should create two column families: Books and Tags2BooksIndex. The former stores all the info about a book (including all its tags), while the latter is an index that maps from tags to books so that for a given tag, we can quickly find all the books having that tag. All these look fine. But I have a question:

Considering how to add a new book to the db: (1) append a new row into column family Books, (2) update Tags2BooksIndex to add the new book to all the tag rows associated with this book.

Let's say 2 seconds after we complete step (1), the new book row has been replicated to all the nodes it is supposed to go, and step (2) is still on-going. Now if I read this new book row from books to get a tag, and then use this tag to check Tags2BooksIndex, it may happen that I can not find the new book from Tags2BooksIndex since either it has not been completely updated yet, or the update has not been replicated to all replica nodes yet.

How to handle such a situation? Replace 2 seconds by 2 milliseconds, we still have a time window of inconsistency. I would like to know the "right/practical" way to handle such a situation.

ashic ashic · Accepted Answer · 2014-10-22T15:00:31

Cassandra falls into the AP side of CAP. It sacrifices consistency. There are ways that can help, with batch statements in cassandra 2.x : http://www.datastax.com/documentation/cql/3.1/cql/cql_reference/batch_r.html

Though the real question here is what would the consequence be of such inconsistency? Is it a 2 minute window during which your search won't give back a new book for a tag? Is that disastrous? In a fault tolerant distributed system, you often have to accept pockets of inconsistency, or sacrifice availability as partitions can and will occur. If your datamodel does need two separate atomic mutations, then batch statements can help, but it takes away a bit of availability. If you're ok with a bit of inconsistency, then you remain available. It comes down to your specific business requirements as to what is and is not acceptable.

cassandra: inconsistent column families

1 Answers