1
votes

I am loading initial data (url list for a crawler) to Cassandra with status crawled=0. Then using Hadoop I crawl all the links and try to change crawled from 0 to something else, for example 1 or 2, or 3. When I check in Cassandra cli interface get ColumnFamily['www.somedomain.com'] the value of crawler column remains the same. If during initial import I have not mentioned crawled column, it adds correctly. This is only one part of the algorithm and I need further updates of this column with other Map/Reduce jobs, etc.

In Thrift and Cassandra API it is said that we have only inserts and deletions. Insert should work as an update.

For crawled column I have UTF8 type.

Mutation class is like this:

  private static Mutation getMutationCrawled(Text crawledVal)
  {
      Text column = new Text();
      column.set("crawled");

      Column c = new Column();

      c.setName(ByteBuffer.wrap(Arrays.copyOf(column.getBytes(), column.getLength())));
      c.setValue(ByteBuffer.wrap(crawledVal.getBytes()));
      c.setTimestamp(System.currentTimeMillis());

      Mutation m = new Mutation();
      m.setColumn_or_supercolumn(new ColumnOrSuperColumn());
      m.column_or_supercolumn.setColumn(c);

      return m;
  }
1

1 Answers

2
votes

Cassandra resolves conflicts using the timestamp of the mutation, with the largest timestamp winning. You can set the timestamp value to whatever you want, but the convention is to set the timestamp as a value in micro seconds. In the example above, you set the timestamp with,

 c.setTimestamp(System.currentTimeMillis());

Most likely the initial import code to populate the values is setting the timestamp in micro seconds. The micro second timestamp values are larger than the millisecond timestamp values, so your updates are being ignored.