Service Task with Async Continuation Never Executes


#21

Bad news - this problem is occurring again.

Back in August, as suggested by thorben, we deactivated the deploymentAware setting. The current setting in our bpm-platform.xml is: <property name="jobExecutorDeploymentAware">false</property>

For several months, we have not had any problems. However just recently the same problem we originally reported is occurring again. Previously, if we deployed a new version of the model that was getting hung, subsequent instances would run fine. Now however, deploying a new version does not clear up the problem.

So far, we can not find a pattern to the problem. When we bring up a new server, it seems that things run fine for a while, but then suddenly the workflow model in question will no longer execute the async service task. Once we get in this state, it seems the only way to resolve it is to tear down the server and bring up a new one.

To demonstrate the problem I have created a modified version of the Camunda http-connector example. I have uploaded this model. invokeRestService.src.bpmn (11.7 KB)

Also, here is the ACT_RU_JOB table entry for one of the stuck instances:

# ID_, REV_, TYPE_, LOCK_EXP_TIME_, LOCK_OWNER_, EXCLUSIVE_, EXECUTION_ID_, PROCESS_INSTANCE_ID_, PROCESS_DEF_ID_, PROCESS_DEF_KEY_, RETRIES_, EXCEPTION_STACK_ID_, EXCEPTION_MSG_, DUEDATE_, REPEAT_, HANDLER_TYPE_, HANDLER_CFG_, DEPLOYMENT_ID_, SUSPENSION_STATE_, JOB_DEF_ID_, PRIORITY_, SEQUENCE_COUNTER_, TENANT_ID_
'cef61540-be55-11e6-ba32-0242ac120003', '1', 'message', NULL, NULL, '0', 'cef5ee2e-be55-11e6-ba32-0242ac120003', 'cef5c717-be55-11e6-ba32-0242ac120003', '559c8fbd-be55-11e6-b2b3-0242ac120003', '_c2ab1f62-ecd4-4401-96c2-0f67552b1b2a', '3', NULL, NULL, NULL, NULL, 'async-continuation', 'transition-create-scope', '558fe58b-be55-11e6-b2b3-0242ac120003', '1', '559c8fbe-be55-11e6-b2b3-0242ac120003', '0', '1', '7d48ec6a-2144-4535-b54c-2c23e703f3e1'

Here is a screen shot of the cockpit showing instances stuck on the service task:

Note one item I did not previously mention, we are deploying the docker versions of the Camunda engine, but with our modifications to bpm-platform.xml. Otherwise, our environment is the same as reported previously in this thread.

Please let me know what additional information I can provide to help solve this issue. Any suggestions for other things we should investigate would be helpful.


#22

We believe we have found the cause of this most recent occurrence of this issue.

We discovered that a workflow had been deployed with a problem that seemed to cause the issue. This workflow contained an async script (javascript) task that contained an infinite loop. When an instance of this workflow was running, we saw a significant spike in CPU usage for the Camunda java application. It seems that this script task was constantly running. We think this may have prevented any other async jobs from running. Once we deleted all running instances of this workflow, everything returned to normal: CPU usage returned to normal and other workflow with async tasks ran as expected.

As of now, we are not experiencing the originally reported issue.


#23

Is there such a thing as a max script / task / etc execution timelimit in Camunda?


#24

Note that we tried putting a BPMN timer event on the script task, such that it cancelled the task after a minute and then continued on to end the workflow. We observed that while the script task was running there was a significant increase in CPU utilization for the Java process (up to 200%). But even after the task was cancelled and the process instance terminated, the CPU utilization never returned to a reasonable value. I have no idea why that would be.


#25

Hi guys,

Jobs are executed by a thread pool (of limited size by default). Build an infinite loop and you block those threads. The process engine never interrupts/kills running threads. That is also not how timer events in Camunda work. Interrupting events never immediately interrupt other transactions (how should that even work in a single JVM or in a cluster?), but parallel transactions that are in conflict with each other are resolved via optimistic locking. That requires both engine commands to terminate.

Cheers,
Thorben


#26

Thorben,

Thank you for the explanation. This helps to understand what is going on.


#27

Hey @thorben I am circling back to this, and have some follow-ups:

You mentioned here:

  1. That you can disable deploymentAware or register manually. If this is the case, when a shared engine is being used, is it typical to have deploymentAware disabled as a default?

  2. What is the reasoning for deploymentAware to be enabled on the default camunda configuration?

  3. How are camunda server restarts handled based on the default configuration? Based on the docs, and your comments above, the default settings would assume that if you have a async process, and you restart your server, then you must manually register the deployment? If yes, referring back to #2 above, why would this be the default behaviour of the engine?

Thanks!


#28

Hi Stephen,

  1. If your processes depend on process application resources, you should not disable deploymentAware. If your process deployments are self-contained (e.g. only script tasks), it should be fine to do so.
  2. This follows from the first answer. Developing process applications is the most frequent use case. Activating deploymentAware in the distributions by default avoids seeing class loading exceptions when using process applications. Especially as a getting started experience this would be awful.
  3. If you have process applications, the engine recognizes the process applications automatically and makes registrations. If you don’t have process applications (or undeployed a process application but still want to execute jobs), then you must do manual registration in a deploymentAware setting.

Cheers,
Thorben


#29

@thorben what are your thoughts on making this env variable changeable? Default/if env is not provided then it defaults to current. Otherwise you can provide a env for this? (at least in the docker container)


#30

Feel free to propose this for the docker container. I don’t really see the necessity to have this as a general engine/platform feature right now.


#31

Fair enough :slight_smile:

@thorben, would it be valid to say that anyone using the engine in a Shared Engine environment, and is using deployments through the REST API, should have deploymentAware set to False?


#32

Yes, I think that’s a good rule.


#33

@thorben I do not believe this “rule” is documented.

This seems to me to be a important “rule” to be aware of for REST deployments. https://docs.camunda.org/manual/7.6/user-guide/process-engine/the-job-executor/#job-execution-in-heterogeneous-clusters could have this note added to it.

But where else would you suggest this be added?


#34

I think https://docs.camunda.org/manual/7.6/user-guide/process-engine/the-job-executor/#cluster-setups should differentiate between clustering with self-contained deployments and clustering with deployments that rely on environment resources (e.g. process applications, application server libraries, etc.). Then https://docs.camunda.org/manual/7.6/user-guide/process-engine/the-job-executor/#job-execution-in-heterogeneous-clusters should make clear that jobExecutorDeploymentAware is only recommended for the first case.

People who struggle with REST API deployments and job execution will probably not look into a section on clustering for the solution to their problem. They will rather look into job execution documentation. Thus, it may be a good idea to document the jobExecutorDeploymentAware flag in the job execution section. There, we could simply document the behavior of this flag independent of cluster setups. The clustering section could then link there and refer to this description.


#35

I will put something together and do a PR for review


#36

Hi Thorben,

Im still a little puzzled by some of the terminology in this context and thus the expected behaviour. Hence let me give a few definitions so I can clarify some points. Context is Tomcat based, shared engine, not down to the column based multi-tenancy yet.

A node is a Tomcat Instance
A cluster is a set of nodes.
A logical engine is defined at the bpm-platform.xml level, thus a node can host many engine instances.
An engine instance uses a common database services.
A logical engine uses a defined DB schema within the DB service.

Hence if I deploy a BPMN resource via the REST API, I understand it will be stored in the DB repository associated with an engine. If the engine is configured to be deploymentaware, then an asynch continuation will not run as the deployment is not registered with the engine.

Hence for deployment aware to work, does that mean that as each node starts up, each engine instance builds an in memory map of the deployments associated with the node based on say examining the process application’s WAR file. What I am struggling with is, is it at the node level, and thus by implication the engine instance level that deployments are registered, or is it at the logical engine level, or is it at the Tomcat node level?

(I can see the deployment IDs passed across in the job executor select jobs SQL, I just dont follow where they are coming from…

regards

Rob


#37

Hi Rob,

Deployment registrations are managed per logical engine in your terms, or per ProcessEngine Java object. Registrations are not persisted, to keep any topology data about engine or cluster setup out of the database. The job executor registrations boil down to a set in ProcessEngineConfigurationImpl, see

In addition, the registrations of deployments with process applications is managed in ProcessApplicationManager, which is held by ProcessEngineConfigurationImpl, so this is also managed per logical engine.

Cheers,
Thorben


#38

@thorben is there any specifics on the reasons for REST deployments to not re-register deployments on server restart?


#39

I suspect the reason is because a server cannot know whether additional required resources have been deployed to the node or not. With a process application in a war file, the container can inform the server that the resources are there.

Perhaps an attribute could be added to the deployment entity, indicating if the deployment need be deployment aware or not. For those which are not deployment aware, ie no additional external resources, on engine startup it could build a list of non deployment aware deployments and register them into the server node. The down side of this is if a new deployment is made, how are all servers notified of the new deployment (introduce polling)? In addition, more thought is required as to the intrinsic race condition/overlap of process application driven deployment versus API driven deployment…

Having a deployment aware but deployment method agnostic cluster would be a nice feature to have…

regards

Rob


#40

Hi!

I had similar problem.
After BPMN schema deploy over REST API, async jobs started perfectly.
After Camunda platform Docker restart, no jobs were executed.

In default camunda-bpm-platform Docker configuration (bpm-platform.xml) jobExecutorDeploymentAware value is true.
Just remove jobExecutorDeploymentAware setting from configuration file was my solution.

Erki