0
votes

I am trying to store all the scheduler jobs in Cassandra.

I designed all the locking tables and seems fine. I am finding difficulty in creating a job queue table.

My Requirement is

1) I need to query all the jobs whichever is not completed.

CREATE TABLE jobs(
   jobId text,
   startTime timestamp,
   endTime timestamp,
   status text,
   state text,
   jobDetails text,
   primary key (X,X)) 
    with clustering order by (X desc);

where, state - on / off
status - running / failed / completed

I am not sure which one to keep as primary key (Since it is unique), Also I need to query all the jobs in 'on' state. Could somebody help me in designing this in Cassandra. Even If you propose anything with composite partition key, I am fine with it.

Edited :

I come up with the data model like this ,

CREATE TABLE job(
   jobId text,
   startTime timestamp,
   endTime timestamp,
   state text,
   status text,
   jobDetails text,
   primary key (state,jobId, startTime) 
    with clustering order by (startTime desc);

I am able to insert like this,

INSERT INTO job (jobId, startTime, endTime, status,state, jobDetails) VALUES('nodestat',toTimestamp(now()), 0,'running','on','{
        "jobID": "job_0002",
        "jobName": "Job 2",
        "description": "This does job 2",
        "taskHandler": require("./jobs/job2").runTask,
        "intervalInMs": 1000
    }');

Query like this,

SELECT * FROM job WHERE state = 'on';

will this create any performance impact ?

3
Queues are anti-pattern in Cassandra: datastax.com/dev/blog/…Alex Ott
My edited model is wrong, It won't support changing the state :( Any idea?Harry
Your query will work only if you create a secondary index on table, but it will create huge partitions as mentioned in 2nd answer. But you may model your data around state itself - put finished tasks into separate table, for example?Alex Ott
Can you help me in this please : stackoverflow.com/questions/48145888/lwt-in-cassandraHarry

3 Answers

1
votes

You are maybe implementing an antipattern for cassandra.

See https://www.datastax.com/dev/blog/cassandra-anti-patterns-queues-and-queue-like-datasets for a blog post discussing what might be your problem when using cassandra as message queue.

Apart from that, there is some information how to do it the "right way" in cassandra on Slideshare: https://de.slideshare.net/alimenkou/high-performance-queues-with-cassandra

There are many projects out there which fit scheduling and or messaging better, for example http://www.quartz-scheduler.org/overview/features.html.

Update for your edit above:

primary key (state,jobId, startTime) 

This will create one partition for each state - resulting in huge partitions and hotspots. Transitioning a jobs state will move it to a different partition - you will have deleted entries and possible compation and performance issues (depending on your number of jobs).

All jobs with state='on' will be on one node (and it's replicas) all jobs with state='off' on another node. You will have two partitions in your design.

1
votes

Since you are open to changes to the model, see if below model works for you

   CREATE TABLE job(
   partition_key,
   jobId text,
   startTime timestamp,
   endTime timestamp,
   state text,
   status text,
   jobDetails text,
   primary key (partition_key,state,jobId, startTime) 
   with clustering order by (startTime desc);

Here the partion_key column value can be calculated based on your volume of jobs.

For example:

If your job count is less than 100K jobs for a single day, then you can keep the partition at single day level i.e. YYYYMMDD (20180105) or if it is 100K per one hour, you can change it to YYYYMMDDHH (2018010518). Change the cluster columns depending upon your filtering order.

  • This way you can able to query the state only if you know when you want to query.
  • Avoiding creating too many partitions or exploding the partition with too many columns
  • It will evenly distribute load into partitions.

It will be helpful to design the model better if you can specify what adjustments/additions you can make to your query.

-1
votes

You need to include equality columns in partition key so your equality columns are status and state. You need to check whether these 2 makes good partition key or not, if not you need to use either custom column or any other existing column as part of partition key. As jobid is to make record unique you can keep it in clustering column. I am assuming you are not querying table on job id.