How to set auto increament ID(integer) in cassandra like SQL?

votes

I have millions of records and now I have to fetch the last latest records. I want to save each record with record_id, and I want this record_id to auto increment when a new record gets inserted.

FOR EXAMPLE: Suppose I have 1000 record and first I want the latest 100 records from 901 to 1000. Now on the second request, I want the next 100 latest record from 801 to 900. I have gone through so many links but have not found anything relevant. Can any body give a proper solution?

Let's suppose table emp contains:
name text,
record_id int,
address text

Let's assume name is primary key and record_id is cluster key. But please don't discuss primary key concept now because my requirement is to create id as a cluster key (I'm using cassandra 2.2.3 and cql 3.3.1).

cassandra

2 Answers

votes

Auto-increment IDs are not supported in Cassandra. But you can use a time-based UUID instead as clustering key to make sure that records will be ordered from oldest to most recent.

votes

Auto-increment IDs don't really work in Cassandra or any other distributed database.

Why? Let's say that you have three nodes. Two nodes get write requests to the same table at the same time. One checks the table for the max ID, and gets a (example) response of 2544. Before that new row can be written, the other node does the same process, and also gets 2544. Now you have two rows being inserted with 2545, and in Cassandra, the last write "wins" so you'll lose the first write.

Consequently, this is also why read-before-write approaches are considered anti-patterns in Cassandra. As Stefan suggested, a TimeUUID offers a way around this problem.

In Cassandra you need to design your tables to fit your query patterns. What I'm hearing, is that you want to retrieve the last 100 updated employees. I would create a specific table to serve that:

CREATE TABLE employee_updates (
  datebucket text,
  record_id timeuuid,
  name text,
  address text,
  PRIMARY KEY (datebucket,record_id))
WITH CLUSTERING ORDER BY (record_id DESC);

Now when you query this table for the last 100 records:

SELECT * FROM employee_udpates WHERE datebucket='20160309' LIMIT 100;

You can get the most-recent 100 records for that particular day.

Note: If "day" is too granular for your solution (only a few employee records get updated each day) then feel free to widen that to something more applicable.

UPDATE:

what if i want to previous latest 100 record that is 801 to 900

This solution actually does have a way to "page" through the results.

Let's insert some rows into your table:

> INSERT INTO employee_updates (datebucket, record_id , address , name ) VALUES ('20160309',now(),'123 main st.','Bob Kerman');
> INSERT INTO employee_updates (datebucket, record_id , address , name ) VALUES ('20160309',now(),'456 Gene ave.','Bill Kerman');
> INSERT INTO employee_updates (datebucket, record_id , address , name ) VALUES ('20160309',now(),'34534 Water st.','Jebediah Kerman');
> INSERT INTO employee_updates (datebucket, record_id , address , name ) VALUES ('20160309',now(),'843 Rocket dr.','Valentina Kerman');
> INSERT INTO employee_updates (datebucket, record_id , address , name ) VALUES ('20160309',now(),'33476 Booster way','Isabella Kerman');
> INSERT INTO employee_updates (datebucket, record_id , address , name ) VALUES ('20160309',now(),'43 Solid Rocket pl.','Helcine Kerman');

Now let me SELECT the top 3 most-recent for today:

> SELECT datebucket, record_id, dateof(record_id), name 
  FROm employee_updates WHERE datebucket='20160309' LIMIT 3;

 datebucket | record_id                            | system.dateof(record_id) | name
------------+--------------------------------------+--------------------------+------------------
   20160309 | 511f9150-e5db-11e5-a4ba-a52893cc9f36 | 2016-03-09 09:43:02+0000 |   Helcine Kerman
   20160309 | 2f9f3670-e5db-11e5-a4ba-a52893cc9f36 | 2016-03-09 09:42:06+0000 |  Isabella Kerman
   20160309 | 23b0dc60-e5db-11e5-a4ba-a52893cc9f36 | 2016-03-09 09:41:46+0000 | Valentina Kerman

(3 rows)

As I have clustered this table on record_id in DESCending order, I can get the next 3 records simply by querying for a record_id less than the last one I read. In this case, that'd be 23b0dc60-e5db-11e5-a4ba-a52893cc9f36:

> SELECT datebucket, record_id, dateof(record_id), name 
  FROm employee_updates WHERE datebucket='20160309' 
  AND record_id < 23b0dc60-e5db-11e5-a4ba-a52893cc9f36 LIMIT 3;

 datebucket | record_id                            | system.dateof(record_id) | name
------------+--------------------------------------+--------------------------+-----------------
   20160309 | 16400100-e5db-11e5-a4ba-a52893cc9f36 | 2016-03-09 09:41:23+0000 | Jebediah Kerman
   20160309 | 0b239cf0-e5db-11e5-a4ba-a52893cc9f36 | 2016-03-09 09:41:05+0000 |     Bill Kerman
   20160309 | 00d648b0-e5db-11e5-a4ba-a52893cc9f36 | 2016-03-09 09:40:47+0000 |      Bob Kerman

(3 rows)