How to Prevent The same job being executed by more than one Executor Threads

Hello,

we deploy our process app into a homogeneous environment. We have ‘asynchronous continuation’ enabled for all of our system tasks per Camunda’s best practices.

Recently, we found that the same job can be executed by more than one Executor thread. Looking at the log below, the thread-1 picked up the job first, and completed the job. But as it tried to update/remove the job in the job table, it had an OptimisticLockingException. We guess because of this failure, another thread-2 was used to pick up the same job and completed the job one more time.

Is this the expected behavior? how can be prevent the thread-2 pick up the same job that is already completed by thread-1?

"thread" : “camundaTaskExecutor-1”,

“applicationName” : “baseWorkflowApp”
"message" : “Service task Publish QC Request to HS completed for business key 2995294”,

**“timestamp” : “2020-03-25T16:34:33.318Z”,

**“thread” : “camundaTaskExecutor-1”,

**“logger” : “org.camunda.bpm.engine.jobexecutor”,

**“message” : “ENGINE-14006 Exception while executing job 7d86a28d-6eb6-11ea-86dc-005056aeee74: OptimisticLockingException. To see the full stacktrace set logging level to DEBUG.”,

{
“timestamp” : “2020-03-25T16:34:33.325Z”,
"thread" : “camundaTaskExecutor-2”,

**“message” : “Service task Publish QC Request to HS started for business key 2995294”,

}

Thank you so much for the help!

Jason

Hi Jason.

The engine is designed and ensures that a job is only executed by one job executor thread. The way this works is the job executor claims the job and ‘locks’ the job in the DB so other threads cannot take ownership.

However this claim is a lease and is usually set to 5 minutes. This is so that if a job crashes, it will eventually be picked up by another thread.

This can cause duplicate execution. If your job takes longer than the lease, another thread may claim the job. If the first thread completes, an optimistic locking exception is thrown because it detects that another thread has taken ownership of the job.

If your jobs are taking more than 5 mins, see if you can change them to asynchronous patterns. f you are making external API calls, if the API blocks for longer than 5 mins, set a more aggressive timeout etc.

Step 1 - confirm this is the cause, ie your job exectuin time exceeds the lease lock time.
Step 2 - identify ways to reduce this time or change to more async integration patterns
Step 3 - Set the job exec locktime to an approriate value (do this as a last resort…)

regards

Rob

1 Like

Hello Rob,
Thank you so much for your reply.
However, I’d like to clarify some points from our observation:

  1. In our case, the execution time is far less than the lease lock time. In the example I attached herein, it only takes 3ms. Given that, it is not likely the job execution time exceeded the lease lock time.
  2. Currently, we have already implemented asynch pattern for potential long-running API calls
  3. Another observation is, once thread-1 got hit OptimisticLockingException, without lease locking expiration (I assume by default it’s 5 min), only after 13ms delay, thread-2 jumps in and picks up the same job.

Our question are the following:

  1. without lease lock expiration, how could thread-2 pick up the same job?
  2. In case of thread-1, it seems it was failed after the job execution is completed. In that case, is it true that another thread will always attempts to re-execute the same job, which results in duplicate job execution, in our case, duplicate API calls?
    Note: it seems the issue is consistent and across the board.

Here is another instance of the issue to help to illustrate the issue above. Basically, for the same business key 2996185, within one second, it was started->completed->started again:

{“timestamp”:“2020-03-26T11:56:54.480-04”,“level”:“INFO”,
“message”:"Service task Acquire ESS token started for business key 2996185",“context”:“default”}
{“timestamp”:“2020-03-26T11:56:54.483-04”,
“message”:"Service task Acquire ESS token completed for business key 2996185",“context”:“default”}
{“timestamp”:“2020-03-26T11:56:54.507-04”,“level”:“WARN”,
“message”:“ENGINE-14006 Exception while executing job 6a11408c-6f7a-11ea-89b1-005056aeee74: OptimisticLockingException. To see the full stacktrace set logging level to DEBUG.”,“context”:“default”}
{“timestamp”:“2020-03-26T11:56:54.520-04”,“level”:“INFO”,
“message”:"Service task Acquire ESS token started for business key 2996185",“context”:“default”}

Thank you!

Jason

Hi Jason,

Are you running in a cluster? Ive seen something similar when nodes in a cluster lose time sync.

Are you running a DB cluster or a DB without the correct isolation level? If transactions are using separate snapshots, then you could have threads tripping over each other…

regards

Rob