Service Task with Async Continuation Never Executes

Thank you for your reply. I have previously read the documentation on this topic. We don’t really need true parallel behavior. In this case, we have a user task with our own custom form and UI that displays different status based on the status of tasks in the other path in the parallel gateway. We needed to make the service task async, so that the user task properly displays the status information.

But none of this really explains to me why the async service task is not executing at all. It seems to me that this is a bug in the workflow engine that is causing this task to never execute.

Hello Chris,

I was wondering if you have had a chance to consider my previous reply, in which I indicate that send the POST to execute the job did cause the stuck task to be executed.

Does this give you any further clue as to what might be causing it to get stuck?

Hey Stephen,

do you use an embedded or shared engine? Do you know that your Job Executor is running?
How many Process instances do you have from this deployment? Do you observe that the jobs are executed after
some time like 1 or 2 minutes?

Best regards,
Chris

Also make sure the deployment that the jobs belong to is registered with the job executor, see the docs here for an explanation and relevant API: https://docs.camunda.org/manual/latest/user-guide/process-engine/the-job-executor/#job-execution-in-heterogeneous-clusters

Although the docs deal with heterogeneous clusters, the setting jobExecutorDeploymentAware is true by default.

We are using a shared engine.
I don’t know if the Job Executor is running. Can you tell me how I would check this?
From the deployment that is failing, there are 11 running instances.
No, the jobs are not executed after some time. They are stuck forever.

As I previously mentioned, the only way to get subsequent jobs to run is to make a new deployment of the workflow model. Then new instances run ok, but the old ones are still stuck forever.

Aha, this give me a clue as to what might be happening.

We are running our application, including the Camunda engine, using virtual machines on Amazon Web Services. In our QA environment, we often bring down deployed instances and then bring up new instances whenever we deploy new code.

Also, we haven’t touched the jobExecutorDeploymentAware setting, so this is true by default (as you noted).

So, could it be possible that the following is happening?

  1. We have a virtual machine up in AWS.
  2. We deploy a workflow model to this machine.
  3. Instances of this model run fine.
  4. We bring down this machine and bring up a new one.
  5. Now when we try to run instances of this same deployment, the async service tasks don’t run because of the deployment aware setting. This is a different machine, so these jobs won’t run on this machine because they are associated with the deployment on the machine which has been brought down?
  6. We observe that if we deploy the exact same model, then instances from the new model run fine, because now the deployment is associated with this new virtual machine.

Does this seem to you that this is the source of our problem? If so, then I guess we need to set jobExecutorDeploymentAware to false.

Can you tell me exactly how nodes are identified in a deployment aware configuration? Is it based on the IP address of the machine or is some other id used?

How do you deploy processes? As part of a process application or directly via REST/Java API?

This is not a centralized configuration. Instead, each engine (aka node, if you have one per node) keeps a set of deployment IDs (i.e. the database ID_ fields in ACT_RE_DEPLOYMENT) of deployments that are registered with it and uses that while acquiring jobs. There is no concept of node/machine/engine identifiers.

Cheers,
Thorben

We deploy directly via the REST API. We are not using a process application.

Ok, then either deactive the deploymentAware setting or register the deployments manually.

For process applications, the process engine has logic that it detects previous versions of a deployment and makes respective registrations. For standalone deployments, that is not the case.

1 Like

Ok, I am going to try deactivating the deploymentAware setting. Thank you very much for your help.

@thorben
Have you experienced this using the docker container as a dev environment?
We have experienced this above behaviour, essentially:

But all inside of the default settings docker container that is provided at GitHub - camunda/docker-camunda-bpm-platform: Docker images for the camunda BPM platform.

Looks like the job executor has problems where you start and stop the container.

Anyone else experience this?

Confirmed. If you run docker container with elements that are async, restart the container, the process engine logic that detects previous versions of a deployment does not seem to function.

These are processes deployed through the API to the process engine.

@camunda

Anyone from Camunda able to advise?

@camunda

Thanks guys!

Hi @StephenOTT

the docker containers use the default Camunda distribution settings, so the job executor is configured to be deployment aware.

If you deploy processes through the REST API you cannot use this setting as already mentioned by @thorben:

This is not related to the docker images.

Or if I misunderstood your setup or deployment procedure please tell me how I can reproduce this behavior.

Cheers,
Sebastian

@menski I do not think this is related “specifically” to the docker image. But i guess possible.

Based on @thorben’s comments above, and the documentation about DeploymentAware (which is on by default).

The steps to reproduce:

  1. Deploy default docker container setup.
  2. Deploy a BPMN with a few automated tasks such as a script task. Make the Start event, and all tasks async.
  3. Run the BPMN.
  4. Stop the Container
  5. Start the Container
  6. Run the BPMN

On the second run of the BPMN, the behaviour we are seeing is that the process will get stuck on the start event and not move forward - the job executor is not running for that process deployment anymore. If you redeploy the BPMN, the problem is resolved. It looks like the DeploymentAware / The details that @thorben describe here:

are not occurring.

How do you deploy the process? As process application (war), by Java Code or with the REST API?

1 Like

Bad news - this problem is occurring again.

Back in August, as suggested by thorben, we deactivated the deploymentAware setting. The current setting in our bpm-platform.xml is: <property name="jobExecutorDeploymentAware">false</property>

For several months, we have not had any problems. However just recently the same problem we originally reported is occurring again. Previously, if we deployed a new version of the model that was getting hung, subsequent instances would run fine. Now however, deploying a new version does not clear up the problem.

So far, we can not find a pattern to the problem. When we bring up a new server, it seems that things run fine for a while, but then suddenly the workflow model in question will no longer execute the async service task. Once we get in this state, it seems the only way to resolve it is to tear down the server and bring up a new one.

To demonstrate the problem I have created a modified version of the Camunda http-connector example. I have uploaded this model. invokeRestService.src.bpmn (11.7 KB)

Also, here is the ACT_RU_JOB table entry for one of the stuck instances:

# ID_, REV_, TYPE_, LOCK_EXP_TIME_, LOCK_OWNER_, EXCLUSIVE_, EXECUTION_ID_, PROCESS_INSTANCE_ID_, PROCESS_DEF_ID_, PROCESS_DEF_KEY_, RETRIES_, EXCEPTION_STACK_ID_, EXCEPTION_MSG_, DUEDATE_, REPEAT_, HANDLER_TYPE_, HANDLER_CFG_, DEPLOYMENT_ID_, SUSPENSION_STATE_, JOB_DEF_ID_, PRIORITY_, SEQUENCE_COUNTER_, TENANT_ID_
'cef61540-be55-11e6-ba32-0242ac120003', '1', 'message', NULL, NULL, '0', 'cef5ee2e-be55-11e6-ba32-0242ac120003', 'cef5c717-be55-11e6-ba32-0242ac120003', '559c8fbd-be55-11e6-b2b3-0242ac120003', '_c2ab1f62-ecd4-4401-96c2-0f67552b1b2a', '3', NULL, NULL, NULL, NULL, 'async-continuation', 'transition-create-scope', '558fe58b-be55-11e6-b2b3-0242ac120003', '1', '559c8fbe-be55-11e6-b2b3-0242ac120003', '0', '1', '7d48ec6a-2144-4535-b54c-2c23e703f3e1'

Here is a screen shot of the cockpit showing instances stuck on the service task:

Note one item I did not previously mention, we are deploying the docker versions of the Camunda engine, but with our modifications to bpm-platform.xml. Otherwise, our environment is the same as reported previously in this thread.

Please let me know what additional information I can provide to help solve this issue. Any suggestions for other things we should investigate would be helpful.

1 Like

We believe we have found the cause of this most recent occurrence of this issue.

We discovered that a workflow had been deployed with a problem that seemed to cause the issue. This workflow contained an async script (javascript) task that contained an infinite loop. When an instance of this workflow was running, we saw a significant spike in CPU usage for the Camunda java application. It seems that this script task was constantly running. We think this may have prevented any other async jobs from running. Once we deleted all running instances of this workflow, everything returned to normal: CPU usage returned to normal and other workflow with async tasks ran as expected.

As of now, we are not experiencing the originally reported issue.

Is there such a thing as a max script / task / etc execution timelimit in Camunda?

Note that we tried putting a BPMN timer event on the script task, such that it cancelled the task after a minute and then continued on to end the workflow. We observed that while the script task was running there was a significant increase in CPU utilization for the Java process (up to 200%). But even after the task was cancelled and the process instance terminated, the CPU utilization never returned to a reasonable value. I have no idea why that would be.