2
votes

I have a MarkLogic 7 database in which several documents are inserted and every document has its own created-on and released-on. Say for example if a document is inserted into the database at 1400 hrs and its released-on value is 1700 hrs then I need to POST this document to an external REST service at 1700 hrs.

I have tried the following options:

  1. Configure a CPF pipeline such that whenever a document is inserted it's released-on value is read and a Scheduled Task is created to trigger based on the timestamp value read from released-on.

    Following are the queries/ observations for this approach:

    1. Since admin configuration manipulation APIs are not transactionally protected operations I need to force a lock on some URI in order to create Scheduled Tasks from within CPF action modules running in parallel. For details read here

    2. When I insert 1000 documents it takes around 20 minutes for the CPF action modules to trigger and create 1000 scheduled tasks based on the released-on value read from the inserted document.

    3. How can I pass the URI of the document that triggered the CPF action module to the Schedule Task which got created from within the CPF action module based on the released-on value read from the document?

  2. Configure a CPF pipeline such that whenever a document is inserted it's released-on value is read and xdmp:sleep() is called with the milliseconds remaining between current date Time and the value of released-on in the document.

    Following are the queries/ observations for this approach:

    1. The Task Server threads on which the CPF action modules are triggered remain occupied and are not released when xdmp:sleep() is called from within them due to which at any time CPF action module is triggered for 16 maximum documents and others remain in queue.

    2. Is there any way we can configure the sleeping thread to become inactive and let other queued action modules to get triggered and when the sleep duration has been elapsed then it again becomes active?

  3. Configure a muti-step CPF pipeline as described here in which the document keeps bouncing between two states till the time released-on timestamp has arrived.

    Following are the queries/ observations for this approach:

    1. Even when 30 documents were inserted the CPU utilization was observed to be 100%

In all the attempts a lot of system resources (CPU and RAM) get utilized even for as small as 1000 documents. I need to find an approach that can cater even 100K documents.

Please let me know in case there are any improvements that can be done in the above mentioned approaches or MarkLogic provides some other way to efficiently handle such scenarios.

1
I think options 2 & 3 will only cause you problems. A better question is why does it take 20 minutes to create 1000 scheduled tasks? Are you locking with the same URI for all inserted documents? Set log level to debug, and check the error log for DEADLOCK messages. That should give you an idea of whether your performance problem is lock-related. - wst
@wst: Yes, I am locking with the same URI: using xdmp:lock-for-update("/sample.xml") The complete CPF action module is posted here - Rahul
Inserting 1000 documents that all lock on the same URI sounds like your problem. Use a URI unique to the inserted document; then all the other tasks can run in parallel. Alternatively, look into Multi-Statement Transactions, which can serially execute expressions in separate transactions. - wst
@wst I tried locking on the document uri for which the CPF action module was triggered using xdmp:lock-for-update($cpf:document-uri) but in this case when I insert multiple documents in parallel then CPF action module gets triggered for all the documents but ONLY ONE Schedule Task gets created. For more read here - Rahul
Ah, I see now. I still suspect the problem could be too many processes contending for that lock. You could go back to your single-URI lock, and have the CPF job xdmp:spawn the actual work to a task. Then you would have at most N (N = Task Server Threads) tasks contending for the lock, not all 1000. Reduce N to 2 or 1 and you might get an improvement. Worth a shot. - wst

1 Answers

2
votes

Rather than CPF, you could set up a scheduled job that will run, say, every 10 minutes and look for documents that are ready to be published. That job would look for documents with released-on values between fn:current-dateTime() and the last time the job ran, which I would save in the database.

For each of those documents, you would spawn a task to POST the document, so that an error in one doesn't cause problems for the others. After looping through, save the current time in the database for the next time.

The 10-minute window can be as large or small as you like, depending on your tolerance for a little delay.