Job Executor hangs or stops acquiring jobs? (solved: HTTP-Connector stuck in endless job due to long http request/no response)

Trying to isolate a problem in a engine instance:

  1. Have a short running process that has a Message Start Event (a few seconds).
  2. Have a short running process (a few minutes at max)

There was ~300 messages delivered in a short period of time to process #1 to the Message Start Event. Process #1 sends messages to process #2.

Process #1 processed about half of the messages and then started to stack up process instances on the Message Start Event, and the executor appeared to be suck/stalled/hanging?.

Both process 1 and 2 have large number of async tasks, and multiple parallel branches.

When i restarted the engine, all of the waiting instances at the Message Start Event in process 1 and all waiting tokens in process 2 were nearly instantly processed/executed.

There was no error that i could see in the logs related to Messages. But if there is a keyword is search for i can look back in the logs.

Anyone have ideas on what would cause the executor to hang/stall/get stuck?

Thanks!

Edit:
Note that if i look up the job and manually execute it through the rest API, it executes successfully, but then is stuck on the next wait.

The stuck job has:
retries: 3
exceptionMessage: null
dateDate: null
suspended: false
priority: 0

Edit 2:

Addtional Server Details:

Running on Tomcat. Camunda 7.6. Shared Engine configuration.
MS SQL SERVER Database.

@camunda @thorben any ideas on what would cause this?

Some additional thoughts and notes:

  1. There are a few hundred process instances waiting at a event gateway.
  2. The DB is running on MS SQL
  3. No errors or incidents are being thrown
  4. Restarting the Camunda server does not always resolve the issue.

Hi Stephen,

Ive expereinced similar behaviour on rare occasions in dev environment and once in production on a very old version. Ive never tracked down the cause…

A few questions - how are these processes deployed, as part of a process application or via the Rest API? (I ask as Im still testing the bahaviour of deployment aware and deployment via the REAT API…)

Rob

Deployed as part of REST api.

Hi Stephen,

Ive tried unsuccessfully to trap the job executor SQL query when this condition occurs - perhaps you may have better luck. Ive not been able to consistently produce the behaviour, particularly when Im logging the DB queries…

regards

Rob

Hi Stephen,

You can get more useful log output when you set the following loggers to DEBUG:

  • org.camunda.bpm.engine.jobexecutor
  • org.camunda.bpm.engine.impl.persistence.entity.JobEntity

This will log things like when the job acquisition queries for jobs, etc.

edit: also, which Camunda version do you use?

Cheers,
Thorben

@thorben this is occurring on 7.6.

How do you set those specific loggers to DEBUG on shared engine?

Which application server do you use?

Running on Tomcat.

(Updated the first post with same information for future ref)

Add these lines to ${CATALINA_HOME}/conf/logging.properties:

org.camunda.bpm.engine.jobexecutor.level=ALL
org.camunda.bpm.engine.impl.persistence.entity.JobEntity.level=ALL

In the mean time, I have grabed some snippets of the logs:

After the last time, the logs stopped and no jobs were being added or executed.

After restarting the container, it started camunda back up, added jobs, but did not execute any.

I’m afraid the log is not so useful without the SQL statements being logged. The latter would tell us exactly when and how the engine queries for jobs.

@thorben

Here are the logs from restarting the container after making the changes to the logging.porperties file as mentioned in: Job Executor hangs or stops acquiring jobs? (solved: HTTP-Connector stuck in endless job due to long http request/no response) - #11 by thorben

Anywhere you see things like “--------------------”, this is removal of identifying info.

As the logs progress, you see that it looks for jobs but does not find any.

But you can see the jobs are waiting:

Few things i noticed that:

  1. Calling up the active job through the REST for the process instance shows a normal looking process (a payload you would expect).

  2. If i try and manually execute the job from the Start Event, it will execute the transaction and then wait at the task (in this case, the service task which is running HTTP-connector). When i manually execute the second job (the service task), I never receive a response from the server. I am using Postman, and it just runs “forever”, never seems to timeout or provide a server error.

There is a lot of HTTP-Connector usage in the process definition. Is there a possible connecting with http-connector using up resources? I have my doubts about this given that the logs show the executor looking for jobs but finding none (even for jobs that are waiting as a start event).

Edit:

Further Context:
Just noticed that if you manually execute a job that is stuck as the Start Event, the execution of the job does not show up in the logs. Is this normal? @thorben
If i try to execute the job a second time (after the job has already been completed), i get a InvalidRequestException and the error shows up in the logs.

@thorben is there a way to see which jobs are currently being processed by the executor? (through the api? or?)

For the loggers that we activated, this is normal. Another useful logger is org.camunda.bpm.engine.cmd which logs whenever the engine begins and finishes commands. This logger should write something whenever you start executing a job.

This (and the exceptions in you log file) sounds very much like the HTTP requests take a long time and therefore consume all the job executor’s threads. In this case, the job acquisition thread slows down and acquires less jobs or no jobs at all until jobs can be executed again.

Could you check if your HTTP endpoint is generally slow? You can also configure connect to use a timeout when making requests, so you the jobs should fail in this case and free the execution resources. See HTTP Connector | docs.camunda.org and http://www.baeldung.com/httpclient-timeout (section 4) for an example.

Cheers,
Thorben

1 Like

Does http-connector have a default timeout set?

No. See https://hc.apache.org/httpcomponents-client-ga/httpclient/apidocs/org/apache/http/client/config/RequestConfig.html for the individual request configuration options and their defaults.

So ran some further tests and doing curl or wget directly from shell of camunda server to the HTTP endpoints that http-connector are connecting to provided consistently fast responses.

@thorben which exception in the log file are you referring to?