0
votes

I have a requirement of versioning to be done using cassandra.

Following is my column family definition

create table file_details(id text primary key, fname text, version int, mimetype text);

I have a secondary index created on fname column.

Whenever I do an insert for the same 'fname', the version should be incremented. And when I retrieve a row with fname it should return me the latest version row.

Please suggest what approach needs to be taken.

1
Do you have a requirement for the version to increment by exactly 1 each time? If not, the max of the timestamps for fname and mimetype will be an always increasing number so can be used for versioning.Richard
Yes, I have requirement of increasing the version by exactly 1. Also, can you tell me how will be the query to get the max timestamps for fname and mimetype?Dawood
You can use select writetime(fname), writetime(mimetype) from file_details where id = 'id'; and find the max in your code.Richard
Thanks Richard for the quick response, any idea on what needs to be done if I have increment exactly by 1 each time?Dawood

1 Answers

2
votes

If it's not possible to relax the requirement of versions increasing by 1, one option is to use counters.

Create a table for the data:

create table file_details(id text primary key, fname text, mimetype text);

and a separate table for the version:

create table file_details_version(id text primary key, version counter);

This needs to be a separate table because tables can either contain all counters or no counters.

Then for an update you can do:

insert into file_details(id, fname, mimetype) values ('id1', 'fname', 'mime');
update file_details_version set version = version + 1 where id = 'id1';

Then a read from file_details will always return the latest, and you can find the latest version number from file_details_version.

There are numerous problems with this though. You can't do atomic batches with counters, so the two updates are not atomic - some failure scenarios could lead to only the insert into file_details being persisted. Also, there is no read isolation, so if you read during an update you may get inconsistent data between the two tables, Finally, counter updates in Cassandra are not tolerant of failures, so if a failure happens during a counter update you may double count i.e. increment the version too much.

I think all solutions involving counters will hit these issues. You could avoid counters by generating a unique ID (e.g. a large random number) for each update and inserting that into a row in a separate table. The version would then be the number of IDs in the row. Now you can do atomic updates, and the counts would be tolerant to failures. However, the read time would be O(number of updates) and reads would still not be isolated.