4
votes

I encountered a consistency problem using Hector and Cassandra when we have Quorum for both read and write.

I use MultigetSubSliceQuery to query rows from super column limit size 100, and then read it, then delete it. And start another around.

I found that the row which should be deleted by my prior query is still shown from next query.

And also from a normal Column Family, I updated the value of one column from status='FALSE' to status='TRUE', and the next time I queried it, the status was still 'FALSE'.

More detail:

  1. It has not happened not every time (1/10,000)
  2. The time between the two queries is around 500 ms (but we found one pair of queries in which 2 seconds had elapsed between them, still indicating a consistency problem)
  3. We use ntp as our cluster time synchronization solution.
  4. We have 6 nodes, and replication factor is 3

I understand that Cassandra is supposed to be "eventually consistent", and that read may not happen before write inside Cassandra. But for two seconds?! And if so, isn't it then meaningless to have Quorum or other consistency level configurations?

So first of all, is it the correct behavior of Cassandra, and if not, what data we need to analyze for further investment?

4
After switch from write/read Quorum to write/read ALL, the problem solved, so it should be Cassandra failed to merge data when it found "Digest mismatch", but it not failed for all "Digest mismatch". It is really strange, the Cassandra version is 1.0.3 - Jason T

4 Answers

3
votes

After check the source code with the system log, I found the root cause of the inconsistency. Three factors cause the problem:

  • Create and update same record from different nodes
  • Local system time is not synchronized accurately enough (although we use NTP)
  • Consistency level is QUORUM

Here is the problem, take following as the event sequence

 seqID   NodeA         NodeB          NodeC
 1.      New(.050)     New(.050)      New(.050)
 2.      Delete(.030)  Delete(.030)

First Create request come from Node C with local time stamp 00:00:00.050, assume requests first record in Node A and Node B, then later synchronized with Node C.

Then Delete request come from Node A with local time stamp 00:00:00.030, and record in node A and Node B.

When read request come, Cassandra will do version conflict merge, but the merge only depend on time stamp, so although Delete happened after Create, but the merge final result is "New" which has latest time stamp due to local time synchronization issue.

1
votes

I also faced similar a issue. The issue occured because cassandra driver uses server timestamp by default to check which query is latest. However in latest version of cassandra driver they have changes it and now by default they are using client timestamp.

I have described the details of issue here

0
votes

The deleted rows may be showing up as "range ghosts" because of the way that distributed deletes work: see http://wiki.apache.org/cassandra/FAQ#range_ghosts

If you are reading and writing individual columns both at CL_QUORUM, then you should always get full consistency, regardless of the time interval (provided strict ordering is still observed, i.e. you are certain that the read is always after the write). If you are not seeing this, then something, somewhere, is wrong.

To start with, I'd suggest a) verifying that the clients are syncing to NTP properly, and/or reproduce the problem with times cross-checked between clients somehow, and b) maybe try to reproduce the problem with CL_ALL.

Another thought - are your clients synced with NTP, or just the Cassandra server nodes? Remember that Cassandra uses the client timestamps to determine which value is the most recent.

-1
votes

I'm running into this problem with one of my clients/node. The other 2 clients I'm testing with (and 2 other nodes) run smoothly. I have a test that uses QUORUM in all reads and all writes and it fails very quickly. Actually some processes do not see anything from the others and others may always see data even after I QUORUM remove it.

In my case I turned on the logs and intended to test the feat with the tail -F command:

tail -F /var/lib/cassandra/log/system.log

to see whether I was getting some errors as presented here. To my surprise the tail process itself returned an error:

tail: inotify cannot be used, reverting to polling: Too many open files

and from another thread this means that some processes will fail opening files. In other words, the Cassandra node is likely not responding as expected because it cannot properly access data on disk.

I'm not too sure whether this is related to the problem that the user who posted the question, but tail -F is certainly a good way to determine whether the limit of files was reached.

(FYI, I have 5 relatively heavy servers running on the same machine so I'm not too surprise about the fact. I'll have to look into increasing the ulimit. I'll report here again if I get it fixed in this way.)

More info about the file limit and the ulimit command line option: https://askubuntu.com/questions/181215/too-many-open-files-how-to-find-the-culprit

--------- Update 1

Just in case, I first tested using Java 1.7.0-11 from Oracle (as mentioned below, I first used a limit of 3,000 without success!) The same error would popup at about the same time when running my Cassandra test (Plus even with the ulimit of 3,000 the tail -F error would still appear...)

--------- Update 2

Okay! That worked. I changed the ulimit to 32,768 and the problems are gone. Note that I had to enlarge the per user limit in /etc/security/limits.conf and run sudo sysctl -p before I could bump the maximum to such a high number. Somehow the default upper limit of 3000 was not enough even though the old limit was only 1024.