Camunda process stops midway in processing instances

Ankit_Koshti · July 16, 2021, 11:53am

We have a set of service and scripts that are setup in the camunda engine. process is initiated by an API call and each object (process instance) pass through a series of internal checks, external api calls and eventually into a mongo DB.

The deployment is a jar with camunda dependency in built in springboot. Each instance is processed with a speed of 2500 per min

Eventually the engine stops with no errors or logs as below

Job executor fails to fetch any new job and polls exponentially.

06:53:21.623 [JobExecutor[org.camunda.bpm.engine.spring.components.jobexecutor.SpringJobExecutor]] DEBUG o.c.b.e.i.p.e.J.selectNextJobsToExecute - ==> Preparing: select RES.ID_, RES.REV_, RES.DUEDATE_, RES.PROCESS_INSTANCE_ID_, RES.EXCLUSIVE_ from ACT_RU_JOB RES where (RES.RETRIES_ > 0) and ( RES.DUEDATE_ is null or RES.DUEDATE_ <= ? ) and (RES.LOCK_OWNER_ is null or RES.LOCK_EXP_TIME_ < ?) and RES.SUSPENSION_STATE_ = 1 and ( ( RES.EXCLUSIVE_ = 1 and not exists( select J2.ID_ from ACT_RU_JOB J2 where J2.PROCESS_INSTANCE_ID_ = RES.PROCESS_INSTANCE_ID_ – from the same proc. inst. and (J2.EXCLUSIVE_ = 1) – also exclusive and (J2.LOCK_OWNER_ is not null and J2.LOCK_EXP_TIME_ >= ?) – in progress ) ) or RES.EXCLUSIVE_ = 0 ) LIMIT ? OFFSET ? 06:53:21.623 [JobExecutor[org.camunda.bpm.engine.spring.components.jobexecutor.SpringJobExecutor]] DEBUG o.c.b.e.i.p.e.J.selectNextJobsToExecute - ==> Parameters: 2021-07-16 06:53:21.623(Timestamp), 2021-07-16 06:53:21.623(Timestamp), 2021-07-16 06:53:21.623(Timestamp), 3(Integer), 0(Integer) 06:53:21.625 [JobExecutor[org.camunda.bpm.engine.spring.components.jobexecutor.SpringJobExecutor]] DEBUG o.c.b.e.i.p.e.J.selectNextJobsToExecute - <== Total: 0

The issue can be repeated n number of times and it always stops processing at 15K. We have tried

changing the initial page size
adding wait timer between each process
add wait timer between each fetch for data passing through the engine

Cloudwatch logs for CPU is around 20%, Memory 30% and RDS connections within limit.

AN interesting thing i noticed is, when it starts to fail it is when it tries to fetch from ACT_RU_JOB table but before this it the query was

select * from ACT_RE_PROCDEF p1 inner join (select KEY_, TENANT_ID_, max(VERSION_) as MAX_VERSION from ACT_RE_PROCDEF RES where KEY_ = ? group by TENANT_ID_, KEY_) p2 on p1.KEY_ = p2.KEY_ where p1.VERSION_ = p2.MAX_VERSION and (p1.TENANT_ID_ = p2.TENANT_ID_ or (p1.TENANT_ID_ is null and p2.TENANT_ID_ is null))

Niall · July 16, 2021, 1:49pm

Can you give some more details?
Version of Camunda?
Database being used by Camunda?
Upload your process model.

Ankit_Koshti · July 16, 2021, 1:52pm

Yes, We are using camunda version 7.15
The backend data base is Amazon Aurora mysql 5.7. process.bpmn (11.0 KB)

A point to add… it works perfectly fine with our dev database which is a t3.medium(processed all 150K leads) but fails on an r4.xlarge and serverless database. The camunda VM is the same host.

Niall · July 16, 2021, 1:58pm

What is returned from the call?
Are nay errors being thrown back?

Can you explain what you mean by this?

Ankit_Koshti · July 16, 2021, 2:06pm

The swagger api basically starts a loop that fetched 1000 objects from Mongo db and processes them individually.

for (final Store feature: totalobjects) {
            try{
				// This is where we tried to add Thread.sleep(1000)
				
				final Map<String, Object> variables = new HashMap<String, Object>();
				variables.put(KeyConstants.FEATURE, feature);
                final ProcessInstance process = runtimeService.startProcessInstanceByKey(KeyConstants.KEY_PROCESS, variables);

				ProcessEngine defaultEngine = ProcessEngines.getDefaultProcessEngine();
    
				JobExecutor jobExecutor = ((ProcessEngineConfigurationImpl)defaultEngine.getProcessEngineConfiguration())
				.getJobExecutor();
            }catch (Exception exception){
                exception.printStackTrace();
            }
        }

KushalDesai-TomTom · July 16, 2021, 3:48pm

Adding to what ankit said → Our project is build on Spring boot camunda . We have deployed JAR on EC2 instance . We got API to trigger the over all process .

Internal logic is more of fetching data from noSQL DB with batch of 1000 where total data is more than 10L . Now for each object out of that 1000 batched data, we create process as mentioned by Ankit. Somehow after 15th iteration , process is not getting created and nor we are facing any error. but the exact query which is shared by ankit that we can see in logs. Hope this help

aravindhrs · July 16, 2021, 5:53pm

@Ankit_Koshti /@KushalDesai-TomTom The issue is when you start the process instance synchronously the execution will continue the execution on the borrowed client thread and the response will be returned (client thread will be released) once the execution token reaches the wait state/save point.

Given your process doesn’t have any wait states or save point, so that entire process execution will be on client thread (blocking call).

And it’s not a good idea to put Thread.sleep(x) in the execution code and it can be avoided. Thread.sleep is bad ! It blocks the current thread and renders it unusable for further work. Thread.Sleep(n) means block the current thread for at least the number of timeslices (or thread quantums) that can occur within n milliseconds.

If the current thread is a foreground thread, Thread.Sleep(n) also means your application cannot exit for >n milliseconds. After all foreground threads have completed, the CLR will allow an application to terminate. If the current thread is a background thread, and the application exits the thread will simply never be awoken (not good if you need your objects to be disposed or finalized).

To solve the issue:

Set a transaction boundary/save point to the StartEvent of the process like this camunda:asyncBefore="true". So the process will start asynchronously and client thread will be released. Camunda’s job executor will take care of starting the process instance.

<bpmn:startEvent id="StartEvent_1" camunda:asyncBefore="true">
      <bpmn:outgoing>Flow_1md3iqi</bpmn:outgoing>
    </bpmn:startEvent>

KushalDesai-TomTom · July 16, 2021, 6:12pm

Hi @aravindhrs ,

Currently we have all the delegates as service tasks. Each service task perform heavy operations. Do you think we should make them external service task ? We tried asyncBefore:true as well but it dint work out for us.

aravindhrs · July 16, 2021, 6:18pm

for service tasks or startEvent? It could be better if you use for both.

For long running tasks it’s recommended to use “External Task Patterns”

KushalDesai-TomTom · July 16, 2021, 6:43pm

Hi @aravindhrs

Thanks for prompt reply. But our process is more of FIFO. On the last row / object of an array of data we need to trigger particular block of code. So what I can understand is making async to all service tasks will break that logic as data will start flowing in un organized order…

aravindhrs · July 17, 2021, 2:39am

@KushalDesai-TomTom Is ordering required only for starting the process instance or the followed service tasks too? If ordered executions for service tasks across process instances are not required then you can set save point and use external task patterns to the operations that heavy lifting from process engine loads.

system · January 30, 2024, 1:14pm