4
votes

We are facing a major incident in our Camunda Orchestrator. When we hit 100 running process instances, Camunda Cockpit takes an eternity and never responds. We have the same issue when calling /app/engine/. Few messages are being consumed from RabbitMQ, and then everything stops.

The application however is not down. I suspect a process engine configuration issue, because of being unable to get the job executor log.

When I set JobExecutorActivate to false, all things go right for Cockpit and queue consumption, but processes stop at the end of the first subprocess.

We have this log loop non stop:

2018/11/17 14:47:33.258 DEBUG ENGINE-14012 Job acquisition thread woke up
2018/11/17 14:47:33.258 DEBUG ENGINE-14022 Acquired 0 jobs for process engine 'default': []
2018/11/17 14:47:33.258 DEBUG ENGINE-14023 Execute jobs for process engine 'default': [8338]
2018/11/17 14:47:33.258 DEBUG ENGINE-14023 Execute jobs for process engine 'default': [8217]
2018/11/17 14:47:33.258 DEBUG ENGINE-14023 Execute jobs for process engine 'default': [8256]
2018/11/17 14:47:33.258 DEBUG ENGINE-14011 Job acquisition thread sleeping for 100 millis
2018/11/17 14:47:33.359 DEBUG ENGINE-14012 Job acquisition thread woke up

And this log too (for queue consumption):

2018/11/17 15:04:19.582 DEBUG Waiting for message from consumer. {"null":null}
2018/11/17 15:04:19.582 DEBUG Retrieving delivery for Consumer@5d05f453: tags=[{amq.ctag-0ivcbc2QL7g-Duyu2Rcbow=queue_response}], channel=Cached Rabbit Channel: AMQChannel(amqp://[email protected]:5672/,4), conn: Proxy@77a5983d Shared Rabbit Connection: SimpleConnection@17a1dd78 [delegate=amqp://[email protected]:5672/, localPort= 49812], acknowledgeMode=AUTO local queue size=0 {"null":null}

Environment : Spring Boot 2.0.3.RELEASE, Camunda v7.9.0 with PostgreSQL, RabbitMQ

Camunda BPM listen and push to 165 RabbitMQ queue.

Configuration :

# Data source (PostgreSql)
com.campDo.fr.camunda.datasource.url=jdbc:postgresql://localhost:5432/campDo
com.campDo.fr.camunda.datasource.username=campDo
com.campDo.fr.camunda.datasource.password=password
com.campDo.fr.camunda.datasource.driver-class-name=org.postgresql.Driver
com.campDo.fr.camunda.bpm.database.jdbc-batch-processing=false
oms.camunda.retry.timer=1
oms.camunda.retry.nb-max=2

SpringProcessEngineConfiguration :

@Bean
    public SpringProcessEngineConfiguration processEngineConfiguration() throws IOException {
        final SpringProcessEngineConfiguration config = new SpringProcessEngineConfiguration();
        config.setDataSource(camundaDataSource);
        config.setDatabaseSchemaUpdate("true");
        config.setTransactionManager(transactionManager());
        config.setHistory("audit");
        config.setJobExecutorActivate(true);
        config.setMetricsEnabled(false);
        final Resource[] resources = resourceLoader.getResources(CLASSPATH_ALL_URL_PREFIX + "/processes/*.bpmn");
        config.setDeploymentResources(resources);

        return config;
    }

Pom dependencies :

 <dependency>
            <groupId>org.camunda.bpm.springboot</groupId>
            <artifactId>camunda-bpm-spring-boot-starter</artifactId>
        </dependency>
        <dependency>
            <groupId>org.camunda.bpm.springboot</groupId>
            <artifactId>camunda-bpm-spring-boot-starter-webapp</artifactId>
        </dependency>
        <dependency>
            <groupId>org.camunda.bpm.springboot</groupId>
            <artifactId>camunda-bpm-spring-boot-starter-rest</artifactId>
        </dependency>

I am quite sure that my job executor config is wrong.

Update :

I can start cockpit and make Camunda consume messages by setting JobExecutorActivate to false, but processes are still stopping at the first job executor required step:

config.setJobExecutorActivate(false);

Thanks for your help.

1

1 Answers

2
votes

First: if your process contains async steps (Jobs) then it will pause. Activating the jobExecutor will just say that camunda should manage how these jobs are worked on. If you disable the executor, your processes will still stop and since no-one will execute them, they remain stopped. Disabling job-execution is only sensible during testing or when you have multiple nodes and only some of them should do processing.

To your main issue: the job executor works with a threadPool. From what you describe, it is very likely, that all threads in the pool block forever, so they never finish and never return, meaning your system is stuck.

This happened to us a while ago when working with a smtp server, there was an infinite timeout on the connection so the threads kept waiting although the machine was not available.

Since job execution in camunda is highly reliable and well tested per se, I yywould suggest that you double check everything you do in your delegates, if you are lucky (and I am right) you will find the spot where you just wait forever ...