Taken from the example in http://www.ibm.com/developerworks/library/os-apache-cassandra/. Suppose we are concerning with two entities: books and tags. A book has multiple tags, so the relationship between them is 1:M.
According to the article, we should create two column families: Books
and Tags2BooksIndex
. The former stores all the info about a book (including all its tags), while the latter is an index that maps from tags to books so that for a given tag, we can quickly find all the books having that tag. All these look fine. But I have a question:
Considering how to add a new book to the db: (1) append a new row into column family Books
, (2) update Tags2BooksIndex
to add the new book to all the tag rows associated with this book.
Let's say 2 seconds after we complete step (1), the new book row has been replicated to all the nodes it is supposed to go, and step (2) is still on-going. Now if I read this new book row from books
to get a tag, and then use this tag to check Tags2BooksIndex
, it may happen that I can not find the new book from Tags2BooksIndex
since either it has not been completely updated yet, or the update has not been replicated to all replica nodes yet.
How to handle such a situation? Replace 2 seconds by 2 milliseconds, we still have a time window of inconsistency. I would like to know the "right/practical" way to handle such a situation.