Service Task with Async Continuation Never Executes


#1

We have a workflow model that, sometimes, has running instances that get stuck on a service task. The service task is configured with Asynchronous Before. This seems to happen randomly. But, once an instance of the model gets stuck in this way, all subsequent instances of this model that are initiated also get stuck at the same service task. Here’s the really interesting behavior - if we make a new deployment of this model (with no changes to the model at all), we can then run instances of this new deployment successfully. However, the instances of the previous deployment that were previously stuck in the service task, remain stuck there.

Our environment is as follows:
Camunda 7.5.0
Running in the Apache Tomcat container
Using the standalone engine configuration
Using a mysql database running on Amazon RDS

The relevant portion of the model is shown below. The service task named “Redact Uploaded Documents” is the task that gets stuck. This task is an http-connector that issues a POST to one of our own application endpoints. We have extensive logging in our application and the logging indicates that our endpoint is not getting called. We have also examined the catalina log and find no errors there. So it appears that this service task is not getting executed at all.

A Cockpit screenshot is show below showing an instance stuck in this state:

I have tried to track down this problem by executing various REST API queries to check on the activities, jobs, etc. Those queries are shown below:

engine-rest/execution?processInstanceId=df7061f3-5824-11e6-8ecb-0242ac120003

    [
      {
        "id": "df7061f3-5824-11e6-8ecb-0242ac120003",
        "processInstanceId": "df7061f3-5824-11e6-8ecb-0242ac120003",
        "ended": false,
        "tenantId": null
      },
      {
        "id": "df70890c-5824-11e6-8ecb-0242ac120003",
        "processInstanceId": "df7061f3-5824-11e6-8ecb-0242ac120003",
        "ended": false,
        "tenantId": null
      },
      {
        "id": "df70890d-5824-11e6-8ecb-0242ac120003",
        "processInstanceId": "df7061f3-5824-11e6-8ecb-0242ac120003",
        "ended": false,
        "tenantId": null
      },
      {
        "id": "df70890e-5824-11e6-8ecb-0242ac120003",
        "processInstanceId": "df7061f3-5824-11e6-8ecb-0242ac120003",
        "ended": false,
        "tenantId": null
      }
    ]

engine-rest/process-instance/df7061f3-5824-11e6-8ecb-0242ac120003/activity-instances

{
  "id": "df7061f3-5824-11e6-8ecb-0242ac120003",
  "parentActivityInstanceId": null,
  "activityId": "RedactionDemo:39:4e60cb5c-52a0-11e6-bd44-0242ac120003",
  "activityType": "processDefinition",
  "processInstanceId": "df7061f3-5824-11e6-8ecb-0242ac120003",
  "processDefinitionId": "RedactionDemo:39:4e60cb5c-52a0-11e6-bd44-0242ac120003",
  "childActivityInstances": [
    {
      "id": "UserTask_0fg7rbr:df70b01f-5824-11e6-8ecb-0242ac120003",
      "parentActivityInstanceId": "df7061f3-5824-11e6-8ecb-0242ac120003",
      "activityId": "UserTask_0fg7rbr",
      "activityType": "userTask",
      "processInstanceId": "df7061f3-5824-11e6-8ecb-0242ac120003",
      "processDefinitionId": "RedactionDemo:39:4e60cb5c-52a0-11e6-bd44-0242ac120003",
      "childActivityInstances": [],
      "childTransitionInstances": [],
      "executionIds": [
        "df70890e-5824-11e6-8ecb-0242ac120003"
      ],
      "activityName": "Wait Screen",
      "name": "Wait Screen"
    }
  ],
  "childTransitionInstances": [
    {
      "id": "df70890d-5824-11e6-8ecb-0242ac120003",
      "parentActivityInstanceId": "df7061f3-5824-11e6-8ecb-0242ac120003",
      "processInstanceId": "df7061f3-5824-11e6-8ecb-0242ac120003",
      "processDefinitionId": "RedactionDemo:39:4e60cb5c-52a0-11e6-bd44-0242ac120003",
      "activityId": "ServiceTask_0xs19ad",
      "activityName": "Redact Uploaded Documents",
      "activityType": "serviceTask",
      "executionId": "df70890d-5824-11e6-8ecb-0242ac120003",
      "targetActivityId": "ServiceTask_0xs19ad"
    }
  ],
  "executionIds": [
    "df7061f3-5824-11e6-8ecb-0242ac120003",
    "df70890c-5824-11e6-8ecb-0242ac120003"
  ],
  "activityName": "Redaction Demo",
  "name": "Redaction Demo"
}

engine-rest/process-instance/df7061f3-5824-11e6-8ecb-0242ac120003

{
  "links": [],
  "id": "df7061f3-5824-11e6-8ecb-0242ac120003",
  "definitionId": "RedactionDemo:39:4e60cb5c-52a0-11e6-bd44-0242ac120003",
  "businessKey": null,
  "caseInstanceId": null,
  "ended": false,
  "suspended": false,
  "tenantId": null
}

engine-rest/job-definition?processDefinitionId=RedactionDemo:39:4e60cb5c-52a0-11e6-bd44-0242ac120003

[
  {
    "id": "4e60cb5d-52a0-11e6-bd44-0242ac120003",
    "processDefinitionId": "RedactionDemo:39:4e60cb5c-52a0-11e6-bd44-0242ac120003",
    "processDefinitionKey": "RedactionDemo",
    "jobType": "async-continuation",
    "jobConfiguration": "async-before",
    "activityId": "ServiceTask_0xs19ad",
    "suspended": false,
    "overridingJobPriority": null,
    "tenantId": null
  }
]

engine-rest/job?processInstanceId=df7061f3-5824-11e6-8ecb-0242ac120003

[
  {
    "id": "df7ac245-5824-11e6-8ecb-0242ac120003",
    "jobDefinitionId": "4e60cb5d-52a0-11e6-bd44-0242ac120003",
    "processInstanceId": "df7061f3-5824-11e6-8ecb-0242ac120003",
    "processDefinitionId": "RedactionDemo:39:4e60cb5c-52a0-11e6-bd44-0242ac120003",
    "processDefinitionKey": "RedactionDemo",
    "executionId": "df70890d-5824-11e6-8ecb-0242ac120003",
    "exceptionMessage": null,
    "retries": 3,
    "dueDate": null,
    "suspended": false,
    "priority": 0,
    "tenantId": null
  }
]

engine-rest/execution/df70890d-5824-11e6-8ecb-0242ac120003

{
  "id": "df70890d-5824-11e6-8ecb-0242ac120003",
  "processInstanceId": "df7061f3-5824-11e6-8ecb-0242ac120003",
  "ended": false,
  "tenantId": null
}

Here is the ACT_RU_JOB db table row for the relevant job:

ID_, REV_, TYPE_, LOCK_EXP_TIME_, LOCK_OWNER_, EXCLUSIVE_, EXECUTION_ID_, PROCESS_INSTANCE_ID_, PROCESS_DEF_ID_, PROCESS_DEF_KEY_, RETRIES_, EXCEPTION_STACK_ID_, EXCEPTION_MSG_, DUEDATE_, REPEAT_, HANDLER_TYPE_, HANDLER_CFG_, DEPLOYMENT_ID_, SUSPENSION_STATE_, JOB_DEF_ID_, PRIORITY_, SEQUENCE_COUNTER_, TENANT_ID_

'df7ac245-5824-11e6-8ecb-0242ac120003', '1', 'message', NULL, NULL, '0', 'df70890d-5824-11e6-8ecb-0242ac120003', 'df7061f3-5824-11e6-8ecb-0242ac120003', 'RedactionDemo:39:4e60cb5c-52a0-11e6-bd44-0242ac120003', 'RedactionDemo', '3', NULL, NULL, NULL, NULL, 'async-continuation', 'transition-create-scope', '4e54483a-52a0-11e6-bd44-0242ac120003', '1', '4e60cb5d-52a0-11e6-bd44-0242ac120003', '0', '1', NULL

Here is the ACT_HI_JOB_LOG entry:

ID_, TIMESTAMP_, JOB_ID_, JOB_DUEDATE_, JOB_RETRIES_, JOB_PRIORITY_, JOB_EXCEPTION_MSG_, JOB_EXCEPTION_STACK_ID_, JOB_STATE_, JOB_DEF_ID_, JOB_DEF_TYPE_, JOB_DEF_CONFIGURATION_, ACT_ID_, EXECUTION_ID_, PROCESS_INSTANCE_ID_, PROCESS_DEF_ID_, PROCESS_DEF_KEY_, DEPLOYMENT_ID_, SEQUENCE_COUNTER_, TENANT_ID_
'df7b3776-5824-11e6-8ecb-0242ac120003', '2016-08-01 20:16:47', 'df7ac245-5824-11e6-8ecb-0242ac120003', NULL, '3', '0', NULL, NULL, '0', '4e60cb5d-52a0-11e6-bd44-0242ac120003', 'async-continuation', 'async-before', 'ServiceTask_0xs19ad', 'df70890d-5824-11e6-8ecb-0242ac120003', 'df7061f3-5824-11e6-8ecb-0242ac120003', 'RedactionDemo:39:4e60cb5c-52a0-11e6-bd44-0242ac120003', 'RedactionDemo', '4e54483a-52a0-11e6-bd44-0242ac120003', '1', NULL

I don’t see anything unusual in any of the data posted above. But may someone can see something that would explain why this process-instance is stuck in the service task.

At this point, we are stuck and have no clue as to what could be causing this. Any help would be greatly appreciated. If additional data is required, please let me know.


Timers not firing after server restart
TimerEvent not getting executed after EventBasedGateway
#2

Have you tried to execute the job?

POST request on /job/df7b3776-5824-11e6-8ecb-0242ac120003/execute

Best regards,
Chris


#3

I didn’t review for errors… However, parallel gateways don’t always behave as expected due to database transaction behavior and/or requirements.

I did see the async-continuation. But, the database requirements will attempt to serialize as transaction requirements demand.

A simulated, of sorts, ‘parallel’ can be achieved with the async setting in Camunda. But, this is really transaction configuration. The documentation goes into very good detail on this topic.

I also ran into this issue.

For true parallel behavior I ended up looking into taking advantage of app’ container’s executor service (on WildFly 10). Configure the BPMN model to illustrate process information but keep DB requirements light and attempt to offload to the executor service. The next problem though is collecting the results. The intermediate message event doesn’t (yet?) support persistent message-event subscription. But, there are work-arounds given that Camunda runs on current platforms. You could mix-and-match various frameworks to achieve requirements. Specifically, I’m looking at Camunda with: Apache-Camel + robust messaging.


#4

Thank you for the suggestion. I tried the suggested POST and it did cause the job to execute, thereby getting this instance out of the “stuck” state.

Does that tell you anything about why these workflows are getting stuck on this task? Surely in an operational system, we can’t be expected to somehow monitor running instances and issue this request if a workflow instance is stuck. I don’t see any reason why the engine is not executing this job. This feels like a bug in the workflow engine.


#5

Thank you for your reply. I have previously read the documentation on this topic. We don’t really need true parallel behavior. In this case, we have a user task with our own custom form and UI that displays different status based on the status of tasks in the other path in the parallel gateway. We needed to make the service task async, so that the user task properly displays the status information.

But none of this really explains to me why the async service task is not executing at all. It seems to me that this is a bug in the workflow engine that is causing this task to never execute.


#6

Hello Chris,

I was wondering if you have had a chance to consider my previous reply, in which I indicate that send the POST to execute the job did cause the stuck task to be executed.

Does this give you any further clue as to what might be causing it to get stuck?


#7

Hey Stephen,

do you use an embedded or shared engine? Do you know that your Job Executor is running?
How many Process instances do you have from this deployment? Do you observe that the jobs are executed after
some time like 1 or 2 minutes?

Best regards,
Chris


Plugin for ProcessEngineConfiguration doesn't work
#8

Also make sure the deployment that the jobs belong to is registered with the job executor, see the docs here for an explanation and relevant API: https://docs.camunda.org/manual/latest/user-guide/process-engine/the-job-executor/#job-execution-in-heterogeneous-clusters

Although the docs deal with heterogeneous clusters, the setting jobExecutorDeploymentAware is true by default.


#9

We are using a shared engine.
I don’t know if the Job Executor is running. Can you tell me how I would check this?
From the deployment that is failing, there are 11 running instances.
No, the jobs are not executed after some time. They are stuck forever.

As I previously mentioned, the only way to get subsequent jobs to run is to make a new deployment of the workflow model. Then new instances run ok, but the old ones are still stuck forever.


#10

Aha, this give me a clue as to what might be happening.

We are running our application, including the Camunda engine, using virtual machines on Amazon Web Services. In our QA environment, we often bring down deployed instances and then bring up new instances whenever we deploy new code.

Also, we haven’t touched the jobExecutorDeploymentAware setting, so this is true by default (as you noted).

So, could it be possible that the following is happening?

  1. We have a virtual machine up in AWS.
  2. We deploy a workflow model to this machine.
  3. Instances of this model run fine.
  4. We bring down this machine and bring up a new one.
  5. Now when we try to run instances of this same deployment, the async service tasks don’t run because of the deployment aware setting. This is a different machine, so these jobs won’t run on this machine because they are associated with the deployment on the machine which has been brought down?
  6. We observe that if we deploy the exact same model, then instances from the new model run fine, because now the deployment is associated with this new virtual machine.

Does this seem to you that this is the source of our problem? If so, then I guess we need to set jobExecutorDeploymentAware to false.

Can you tell me exactly how nodes are identified in a deployment aware configuration? Is it based on the IP address of the machine or is some other id used?


#11

How do you deploy processes? As part of a process application or directly via REST/Java API?

This is not a centralized configuration. Instead, each engine (aka node, if you have one per node) keeps a set of deployment IDs (i.e. the database ID_ fields in ACT_RE_DEPLOYMENT) of deployments that are registered with it and uses that while acquiring jobs. There is no concept of node/machine/engine identifiers.

Cheers,
Thorben


#12

We deploy directly via the REST API. We are not using a process application.


#13

Ok, then either deactive the deploymentAware setting or register the deployments manually.

For process applications, the process engine has logic that it detects previous versions of a deployment and makes respective registrations. For standalone deployments, that is not the case.


#14

Ok, I am going to try deactivating the deploymentAware setting. Thank you very much for your help.


#15

@thorben
Have you experienced this using the docker container as a dev environment?
We have experienced this above behaviour, essentially:


But all inside of the default settings docker container that is provided at https://github.com/camunda/docker-camunda-bpm-platform.

Looks like the job executor has problems where you start and stop the container.

Anyone else experience this?


#16

Confirmed. If you run docker container with elements that are async, restart the container, the process engine logic that detects previous versions of a deployment does not seem to function.

These are processes deployed through the API to the process engine.

@camunda


#17

Anyone from Camunda able to advise?

@camunda

Thanks guys!


#18

Hi @StephenOTT

the docker containers use the default Camunda distribution settings, so the job executor is configured to be deployment aware.

If you deploy processes through the REST API you cannot use this setting as already mentioned by @thorben:

This is not related to the docker images.

Or if I misunderstood your setup or deployment procedure please tell me how I can reproduce this behavior.

Cheers,
Sebastian


#19

@menski I do not think this is related “specifically” to the docker image. But i guess possible.

Based on @thorben’s comments above, and the documentation about DeploymentAware (which is on by default).

The steps to reproduce:

  1. Deploy default docker container setup.
  2. Deploy a BPMN with a few automated tasks such as a script task. Make the Start event, and all tasks async.
  3. Run the BPMN.
  4. Stop the Container
  5. Start the Container
  6. Run the BPMN

On the second run of the BPMN, the behaviour we are seeing is that the process will get stuck on the start event and not move forward - the job executor is not running for that process deployment anymore. If you redeploy the BPMN, the problem is resolved. It looks like the DeploymentAware / The details that @thorben describe here:

are not occurring.


#20

How do you deploy the process? As process application (war), by Java Code or with the REST API?