Job Executor Stops Working - Catastrophic Failure

We are running Camunda 7.6.2-ee on WildFly 10.1. Recently, our Camunda instances have “locked up” and will not execute any further jobs.

We have cancelled all running processes, but it will not add a new job until we restart WildFly. Other observations:

  • WildFly itself seems to be functioning properly
  • You cannot cancel processes through the cockpit GUI
  • The REST interface appears to be working as you can get lists and DELETE processes through it
  • You can start a process but the moment it hits and asynchronous boundary, it stops executing
  • System and database resources are more than adequate
  • A thread dump shows threads related to the Apache http-client in a WAITING state and the org.apache.http.pool.PoolEntryFuture.get(PoolEntryFuture.java:102) is consistently present
  • After cancelling all processes through the REST API, the expected database tables are empty. If we truncate the ACT_RU_JOBDEF table, we still see no further job executions

In effect, Camunda is all but dead and we cannot figure out what is causing this. We do not know at this point in time what has changed (e.g. processes (new or updated)). What I’m looking for is help on how this could possibly occur. If all the processes are cancelled, how can Camunda not start new ones?

Thanks.

Michael

Hi Michael

This [1] thread may be of interest. In summary, it seems that if the connectors get blocked, the engine behavior becomes interesting.

Hence in terms of something changing, if you are calling remote services, verify connectivity and behaviour from your engine node to these remote services…

regards

Rob

[1] Job Executor hangs or stops acquiring jobs? (solved: HTTP-Connector stuck in endless job due to long http request/no response)