Job Executor hangs or stops acquiring jobs? (solved: HTTP-Connector stuck in endless job due to long http request/no response)

Webcyberrob · August 22, 2017, 6:27am

Hi Stephen,

Ive tried unsuccessfully to trap the job executor SQL query when this condition occurs - perhaps you may have better luck. Ive not been able to consistently produce the behaviour, particularly when Im logging the DB queries…

regards

Rob

thorben · August 22, 2017, 7:23am

Hi Stephen,

You can get more useful log output when you set the following loggers to DEBUG:

org.camunda.bpm.engine.jobexecutor
org.camunda.bpm.engine.impl.persistence.entity.JobEntity

This will log things like when the job acquisition queries for jobs, etc.

edit: also, which Camunda version do you use?

Cheers,
Thorben

StephenOTT · August 22, 2017, 1:56pm

@thorben this is occurring on 7.6.

How do you set those specific loggers to DEBUG on shared engine?

thorben · August 22, 2017, 2:17pm

Which application server do you use?

StephenOTT · August 22, 2017, 2:20pm

Running on Tomcat.

(Updated the first post with same information for future ref)

thorben · August 22, 2017, 3:15pm

Add these lines to ${CATALINA_HOME}/conf/logging.properties:

org.camunda.bpm.engine.jobexecutor.level=ALL
org.camunda.bpm.engine.impl.persistence.entity.JobEntity.level=ALL

StephenOTT · August 22, 2017, 3:29pm

In the mean time, I have grabed some snippets of the logs:

gist.github.com

https://gist.github.com/StephenOTT/bb03d30880cef0e0cdc1b6a1410c1bb6

camunda_log

17-Aug-2017 10:39:11.382 INFO [pool-3-thread-3] org.camunda.commons.logging.BaseLogger.logInfo ENGINE-14008 Adding new exclusive job to job executor context. Job Id='d555d6b5-8359-11e7-a7f5-0242ac110016'
17-Aug-2017 10:39:11.617 INFO [pool-3-thread-3] org.camunda.commons.logging.BaseLogger.logInfo ENGINE-14008 Adding new exclusive job to job executor context. Job Id='d58683b0-8359-11e7-a7f5-0242ac110016'
17-Aug-2017 10:39:11.689 INFO [pool-3-thread-3] org.camunda.commons.logging.BaseLogger.logInfo ENGINE-14008 Adding new exclusive job to job executor context. Job Id='d591a747-8359-11e7-a7f5-0242ac110016'
17-Aug-2017 10:39:11.689 INFO [pool-3-thread-3] org.camunda.commons.logging.BaseLogger.logInfo ENGINE-14008 Adding new exclusive job to job executor context. Job Id='d5921c79-8359-11e7-a7f5-0242ac110016'
17-Aug-2017 10:39:12.151 INFO [pool-3-thread-3] org.camunda.commons.logging.BaseLogger.logInfo ENGINE-14008 Adding new exclusive job to job executor context. Job Id='d5d05e0e-8359-11e7-a7f5-0242ac110016'
17-Aug-2017 10:39:12.380 INFO [pool-3-thread-3] org.camunda.commons.logging.BaseLogger.logInfo ENGINE-14008 Adding new exclusive job to job executor context. Job Id='d5f5c071-8359-11e7-a7f5-0242ac110016'
17-Aug-2017 10:39:12.531 INFO [pool-3-thread-3] org.camunda.commons.logging.BaseLogger.logInfo ENGINE-14008 Adding new exclusive job to job executor context. Job Id='d612221a-8359-11e7-a7f5-0242ac110016'
17-Aug-2017 10:39:12.532 INFO [pool-3-thread-3] org.camunda.commons.logging.BaseLogger.logInfo ENGINE-14008 Adding new exclusive job to job executor context. Job Id='d612974c-8359-11e7-a7f5-0242ac110016'
17-Aug-2017 10:39:12.945 INFO [pool-3-thread-3] org.camunda.commons.logging.BaseLogger.logInfo ENGINE-14008 Adding new exclusive job to job executor context. Job Id='d6501595-8359-11e7-a7f5-0242ac110016'
17-Aug-2017 10:39:53.373 INFO [pool-3-thread-3] org.camunda.commons.logging.BaseLogger.logInfo ENGINE-14008 Adding new exclusive job to job executor context. Job Id='ee5a1b81-8359-11e7-a7f5-0242ac110016'

This file has been truncated. show original

After the last time, the logs stopped and no jobs were being added or executed.

After restarting the container, it started camunda back up, added jobs, but did not execute any.

thorben · August 22, 2017, 3:52pm

I’m afraid the log is not so useful without the SQL statements being logged. The latter would tell us exactly when and how the engine queries for jobs.

StephenOTT · August 22, 2017, 5:53pm

@thorben

Here are the logs from restarting the container after making the changes to the logging.porperties file as mentioned in: Job Executor hangs or stops acquiring jobs? (solved: HTTP-Connector stuck in endless job due to long http request/no response) - #11 by thorben

Anywhere you see things like “--------------------”, this is removal of identifying info.

gist.github.com

https://gist.github.com/StephenOTT/2317584106f8ba8e9271aaf6406b915e

camunda_logs_executor-full

22-Aug-2017 13:04:38.416 INFO [main] org.apache.catalina.startup.VersionLoggerListener.log Server version:        Apache Tomcat/8.0.24
22-Aug-2017 13:04:38.418 INFO [main] org.apache.catalina.startup.VersionLoggerListener.log Server built:          Jul 1 2015 20:19:55 UTC
22-Aug-2017 13:04:38.421 INFO [main] org.apache.catalina.startup.VersionLoggerListener.log Server number:         8.0.24.0
22-Aug-2017 13:04:38.421 INFO [main] org.apache.catalina.startup.VersionLoggerListener.log OS Name:               Linux
22-Aug-2017 13:04:38.421 INFO [main] org.apache.catalina.startup.VersionLoggerListener.log OS Version:            4.1.12-94.5.7.el7uek.x86_64
22-Aug-2017 13:04:38.421 INFO [main] org.apache.catalina.startup.VersionLoggerListener.log Architecture:          amd64
22-Aug-2017 13:04:38.421 INFO [main] org.apache.catalina.startup.VersionLoggerListener.log Java Home:             /usr/lib/jvm/java-8-oracle/jre
22-Aug-2017 13:04:38.421 INFO [main] org.apache.catalina.startup.VersionLoggerListener.log JVM Version:           1.8.0_121-b13
22-Aug-2017 13:04:38.421 INFO [main] org.apache.catalina.startup.VersionLoggerListener.log JVM Vendor:            Oracle Corporation
22-Aug-2017 13:04:38.421 INFO [main] org.apache.catalina.startup.VersionLoggerListener.log CATALINA_BASE:         /camunda

This file has been truncated. show original

As the logs progress, you see that it looks for jobs but does not find any.

But you can see the jobs are waiting:

Few things i noticed that:

Calling up the active job through the REST for the process instance shows a normal looking process (a payload you would expect).
If i try and manually execute the job from the Start Event, it will execute the transaction and then wait at the task (in this case, the service task which is running HTTP-connector). When i manually execute the second job (the service task), I never receive a response from the server. I am using Postman, and it just runs “forever”, never seems to timeout or provide a server error.

There is a lot of HTTP-Connector usage in the process definition. Is there a possible connecting with http-connector using up resources? I have my doubts about this given that the logs show the executor looking for jobs but finding none (even for jobs that are waiting as a start event).

Edit:

Further Context:
Just noticed that if you manually execute a job that is stuck as the Start Event, the execution of the job does not show up in the logs. Is this normal? @thorben
If i try to execute the job a second time (after the job has already been completed), i get a InvalidRequestException and the error shows up in the logs.

StephenOTT · August 22, 2017, 9:15pm

@thorben is there a way to see which jobs are currently being processed by the executor? (through the api? or?)

thorben · August 23, 2017, 3:28pm

For the loggers that we activated, this is normal. Another useful logger is org.camunda.bpm.engine.cmd which logs whenever the engine begins and finishes commands. This logger should write something whenever you start executing a job.

This (and the exceptions in you log file) sounds very much like the HTTP requests take a long time and therefore consume all the job executor’s threads. In this case, the job acquisition thread slows down and acquires less jobs or no jobs at all until jobs can be executed again.

Could you check if your HTTP endpoint is generally slow? You can also configure connect to use a timeout when making requests, so you the jobs should fail in this case and free the execution resources. See HTTP Connector | docs.camunda.org and http://www.baeldung.com/httpclient-timeout (section 4) for an example.

Cheers,
Thorben

StephenOTT · August 23, 2017, 3:42pm

Does http-connector have a default timeout set?

thorben · August 23, 2017, 4:28pm

No. See https://hc.apache.org/httpcomponents-client-ga/httpclient/apidocs/org/apache/http/client/config/RequestConfig.html for the individual request configuration options and their defaults.

StephenOTT · August 23, 2017, 4:35pm

So ran some further tests and doing curl or wget directly from shell of camunda server to the HTTP endpoints that http-connector are connecting to provided consistently fast responses.

StephenOTT · August 23, 2017, 4:36pm

@thorben which exception in the log file are you referring to?

StephenOTT · August 23, 2017, 4:45pm

I also deleted all tokens that were sitting on http-connector service tasks but the other jobs did not get picked up by the executor.

StephenOTT · August 23, 2017, 5:05pm

Okay…

So get it working, but unclear what was causing the problem.

I manually deleted all process instances that were running across all definitions except for a few instances that were waiting at the blank start event.

The executor still did not pick up the jobs.

So i restarted the container and upon restart camunda picked-up the jobs and executed them without issues.

Running curl or wget from the camunda container to the endpoints that http-connector was connecting to, results in zero issues; everything connects quickly and as expected.

Most of the tokens that were waiting were in Receive Task and Event Gateways. But this may not be relevant given that the executor did not resolve the problem until server restart. Meaning that the http-connector service tasks could have been causing the problem but there was no indicator of resolution until server restart.

@thorben if jobs are in flight, and the process instance is paused/suspended, does the active job get cancelled or would it remain in queue until completion?

StephenOTT · August 23, 2017, 7:28pm

@thorben Additional question:

When you delete an active process instance that has a job that is currently executing; Does the Job get cancelled mid execution?

Edit: Did some tests. Looks like if a Job is being executed and the process instance is deleted the job will not be cancelled. I tested this by creating a load of jobs and the HTTP-connector endless response issue occurred (i think). And i deleted some potentially problematic process instances, but the job executor did not pick up the changes. Once i restarted the container, the job executor went back to normal

Edit: Have narrowed the problem down to be related to http-connector requests. Further testing tomorrow related to outside network issues (likely).

@thorben can the httpclient configuration settings be accessed within the connector variable of http-connector? For example using a listener or a script in one of the inputs to modify the timeouts? I looked through the docs, but I could not clearly see where the access could be.

thorben · August 24, 2017, 8:02am

Hey Stephen,

It runs until completion and then fails with an OptimisticLockingException since the job and execution tree was updated in the meantime.

No, same optimistic locking behavior as above.

I don’t think so, but haven’t really looked into this. The config I linked to above is used when the client is built. This only happens once, so you can’t configure the timeout on per-request basis this way. But maybe there’s another way with Apache Http client to do this.

StephenOTT · August 24, 2017, 3:52pm

@thorben okay great. We have narrowed it down then to specific use case network connectivity (only re-creatable under certain load conditions cause by camunda.

Any ideas on where to look for potentially modifying the per-request timeouts?