Duplicate timer start events after processing errors

Hi,

I have a process engine in version 7.4 with a job executor running on a PostgreSQL and a custom application that uses the process engine to generate additional processes from a BPMN XML template to allow users to schedule pre-configured reports being generated and sent out to them at a specific time per day.

Some time ago, we ran into a race condition with the process generation logic that resulted in an OutOfMemoryError due to the DeploymentCache filling up (the race condition was a new deployment for a changed generated process in combination with a process definition activation in the same transaction, which fails optimistically in an endless loop).

As a resulting symptom, existing scheduled processes using the timer start event tried to start but often failed to do so and by themselves ran into an OOM. This seems to have resulted into an additional timer being inserted into the database at timer fire time but the old one not being deleted. This effect seemed to have gotten worse by retries and also lock expiration (in the history I can see several process instances started by a 5-minute delay, which is the configured job lock expiration time).

After searching a bit, I found the following bug entry which supposedly should have fixed that issue:
https://app.camunda.com/jira/browse/CAM-2797

However, in my case it didn’t for some reason. I also took a deeper look into the timer-start-event handling code but found no obvious race condition there which would have explained the symptoms.

As a last resort, I wanted to make sure from a database schema perspective that a situation like this simply cannot occur and introduce an appropriate partial unique index to the “act_ru_job” table. But due to the hard-coded statement re-ordering by the DbOperationManager, the index fails because the timer for the new entry is being inserted before the old one is deleted.

Of course, an OOM is a serious issue and there is not much we can do inside the JVM in such case, which is why I want to ensure consistency at the database level.

Hi @ancoron,

it’s a bit hard to understand your issue. Can you please describe your issue more focused and provided an example or a failing test case?

Best regards,
Philipp

Have you set or considered setting the timer start event to be asynchronous after? This would decouple retry from the timer start…

regards

Rob

Sorry for the very late reply on this issue.

@Philipp_Ossler: I tried to reproduce the issue but did not succeed so far, even when creating a OOM situation. Also I cannot produce an OOM in a test-case, or the test itself won’t be able to detect the situation. However, it really looks like the OOM is the major issue here.

@Webcyberrob: thanx for the hint, but my process relies on some “synchronous” steps being executed (actually process variable validation) before it can fork away from the caller. So going async directly is not an option here.

Nevertheless, this situation didn’t arise so far and I have implemented some compensation for detecting and fixing a situation like this.