Data Modelling in Cassandra for job queues

Question

I am trying to store all the scheduler jobs in Cassandra.

I designed all the locking tables and seems fine. I am finding difficulty in creating a job queue table.

My Requirement is

1) I need to query all the jobs whichever is not completed.

CREATE TABLE jobs(
   jobId text,
   startTime timestamp,
   endTime timestamp,
   status text,
   state text,
   jobDetails text,
   primary key (X,X)) 
    with clustering order by (X desc);

where, state - on / off
status - running / failed / completed

I am not sure which one to keep as primary key (Since it is unique), Also I need to query all the jobs in 'on' state. Could somebody help me in designing this in Cassandra. Even If you propose anything with composite partition key, I am fine with it.

Edited :

I come up with the data model like this ,

CREATE TABLE job(
   jobId text,
   startTime timestamp,
   endTime timestamp,
   state text,
   status text,
   jobDetails text,
   primary key (state,jobId, startTime) 
    with clustering order by (startTime desc);

I am able to insert like this,

INSERT INTO job (jobId, startTime, endTime, status,state, jobDetails) VALUES('nodestat',toTimestamp(now()), 0,'running','on','{
        "jobID": "job_0002",
        "jobName": "Job 2",
        "description": "This does job 2",
        "taskHandler": require("./jobs/job2").runTask,
        "intervalInMs": 1000
    }');

Query like this,

SELECT * FROM job WHERE state = 'on';

will this create any performance impact ?

Queues are anti-pattern in Cassandra: datastax.com/dev/blog/… — Alex Ott
My edited model is wrong, It won't support changing the state :( Any idea? — Harry
Your query will work only if you create a secondary index on table, but it will create huge partitions as mentioned in 2nd answer. But you may model your data around state itself - put finished tasks into separate table, for example? — Alex Ott
Can you help me in this please : stackoverflow.com/questions/48145888/lwt-in-cassandra — Harry

Mandraenke Mandraenke · Accepted Answer · 2018-01-05T07:58:18

You are maybe implementing an antipattern for cassandra.

See https://www.datastax.com/dev/blog/cassandra-anti-patterns-queues-and-queue-like-datasets for a blog post discussing what might be your problem when using cassandra as message queue.

Apart from that, there is some information how to do it the "right way" in cassandra on Slideshare: https://de.slideshare.net/alimenkou/high-performance-queues-with-cassandra

There are many projects out there which fit scheduling and or messaging better, for example http://www.quartz-scheduler.org/overview/features.html.

Update for your edit above:

primary key (state,jobId, startTime)

This will create one partition for each state - resulting in huge partitions and hotspots. Transitioning a jobs state will move it to a different partition - you will have deleted entries and possible compation and performance issues (depending on your number of jobs).

All jobs with state='on' will be on one node (and it's replicas) all jobs with state='off' on another node. You will have two partitions in your design.

Data Modelling in Cassandra for job queues

3 Answers