FK Violation while flushing to history (Probably GC-related)


#1

I’m new to Camunda and trying to test drive it. I am using:
Camunda BPM: (v7.9.0-alpha3)
Camunda Spring Boot Starter-Rest: (v3.0.0-alpha1)
H2-DB
The test process is very simple: start - 2 external tasks - end.
order-test

After deploying it, I use a REST client to start/fectchAndLock/complete a few thousand process instances and then get many exceptions. It seems the batch deletion order is wrong.

2018-05-12 20:14:44.324 ERROR 10478 — [nio-8080-exec-7] org.camunda.bpm.engine.context : ENGINE-16004 Exception while closing command context: ENGINE-03083 Exception while executing Batch Database Operations with message ’

Error flushing statements. Cause: org.apache.ibatis.executor.BatchExecutorException: org.camunda.bpm.engine.impl.persistence.entity.ExecutionEntity.deleteExecution (batch index #4) failed. 3 prior sub executor(s) completed successfully, but will be rolled back. Cause: org.h2.jdbc.JdbcBatchUpdateException: Referential integrity constraint violation: “ACT_FK_EXE_PROCINST: PUBLIC.ACT_RU_EXECUTION FOREIGN KEY(PROC_INST_ID_) REFERENCES PUBLIC.ACT_RU_EXECUTION(ID_) (‘0c06e861-55de-11e8-9e9f-0242ef8473e1’)”; SQL statement:

delete from ACT_RU_EXECUTION where ID_ = ? and REV_ = ? [23503-196]

Cause: org.apache.ibatis.executor.BatchExecutorException: org.camunda.bpm.engine.impl.persistence.entity.ExecutionEntity.deleteExecution (batch index #4) failed. 3 prior sub executor(s) completed successfully, but will be rolled back. Cause: org.h2.jdbc.JdbcBatchUpdateException: Referential integrity constraint violation: “ACT_FK_EXE_PROCINST: PUBLIC.ACT_RU_EXECUTION FOREIGN KEY(PROC_INST_ID_) REFERENCES PUBLIC.ACT_RU_EXECUTION(ID_) (‘0c06e861-55de-11e8-9e9f-0242ef8473e1’)”; SQL statement:

delete from ACT_RU_EXECUTION where ID_ = ? and REV_ = ? [23503-196]
org.h2.jdbc.JdbcBatchUpdateException: Referential integrity constraint violation: “ACT_FK_EXE_PROCINST: PUBLIC.ACT_RU_EXECUTION FOREIGN KEY(PROC_INST_ID_) REFERENCES PUBLIC.ACT_RU_EXECUTION(ID_) (‘0c06e861-55de-11e8-9e9f-0242ef8473e1’)”; SQL statement:
delete from ACT_RU_EXECUTION where ID_ = ? and REV_ = ? [23503-196]
org.h2.jdbc.JdbcSQLException: Referential integrity constraint violation: “ACT_FK_EXE_PROCINST: PUBLIC.ACT_RU_EXECUTION FOREIGN KEY(PROC_INST_ID_) REFERENCES PUBLIC.ACT_RU_EXECUTION(ID_) (‘0c06e861-55de-11e8-9e9f-0242ef8473e1’)”; SQL statement:
delete from ACT_RU_EXECUTION where ID_ = ? and REV_ = ? [23503-196]
org.h2.jdbc.JdbcSQLException: Referential integrity constraint violation: “ACT_FK_EXT_TASK_EXE: PUBLIC.ACT_RU_EXT_TASK FOREIGN KEY(EXECUTION_ID_) REFERENCES PUBLIC.ACT_RU_EXECUTION(ID_) (‘0cc333d3-55de-11e8-9e9f-0242ef8473e1’)”; SQL statement:
delete from ACT_RU_EXECUTION where ID_ = ? and REV_ = ? [23503-196]
'. Flush summary:
[
INSERT HistoricExternalTaskLogEntity[0dc67336-55de-11e8-9e9f-0242ef8473e1]
INSERT HistoricActivityInstanceEventEntity[EndEvent_14h6lw0:0dc67337-55de-11e8-9e9f-0242ef8473e1]
DELETE ExternalTaskEntity[0cc333d5-55de-11e8-9e9f-0242ef8473e1]
DELETE ExecutionEntity[0cc333d3-55de-11e8-9e9f-0242ef8473e1]
DELETE ExecutionEntity[0c06e861-55de-11e8-9e9f-0242ef8473e1]
UPDATE HistoricActivityInstanceEventEntity[Task_0pvkpo3:0cc333d4-55de-11e8-9e9f-0242ef8473e1]
UPDATE HistoricProcessInstanceEventEntity[0c06e861-55de-11e8-9e9f-0242ef8473e1]
]

org.camunda.bpm.engine.ProcessEngineException: ENGINE-03083 Exception while executing Batch Database Operations with message ’

Error flushing statements. Cause: org.apache.ibatis.executor.BatchExecutorException: org.camunda.bpm.engine.impl.persistence.entity.ExecutionEntity.deleteExecution (batch index #4) failed. 3 prior sub executor(s) completed successfully, but will be rolled back. Cause: org.h2.jdbc.JdbcBatchUpdateException: Referential integrity constraint violation: “ACT_FK_EXE_PROCINST: PUBLIC.ACT_RU_EXECUTION FOREIGN KEY(PROC_INST_ID_) REFERENCES PUBLIC.ACT_RU_EXECUTION(ID_) (‘0c06e861-55de-11e8-9e9f-0242ef8473e1’)”; SQL statement:

delete from ACT_RU_EXECUTION where ID_ = ? and REV_ = ? [23503-196]

Cause: org.apache.ibatis.executor.BatchExecutorException: org.camunda.bpm.engine.impl.persistence.entity.ExecutionEntity.deleteExecution (batch index #4) failed. 3 prior sub executor(s) completed successfully, but will be rolled back. Cause: org.h2.jdbc.JdbcBatchUpdateException: Referential integrity constraint violation: “ACT_FK_EXE_PROCINST: PUBLIC.ACT_RU_EXECUTION FOREIGN KEY(PROC_INST_ID_) REFERENCES PUBLIC.ACT_RU_EXECUTION(ID_) (‘0c06e861-55de-11e8-9e9f-0242ef8473e1’)”; SQL statement:

delete from ACT_RU_EXECUTION where ID_ = ? and REV_ = ? [23503-196]
org.h2.jdbc.JdbcBatchUpdateException: Referential integrity constraint violation: “ACT_FK_EXE_PROCINST: PUBLIC.ACT_RU_EXECUTION FOREIGN KEY(PROC_INST_ID_) REFERENCES PUBLIC.ACT_RU_EXECUTION(ID_) (‘0c06e861-55de-11e8-9e9f-0242ef8473e1’)”; SQL statement:
delete from ACT_RU_EXECUTION where ID_ = ? and REV_ = ? [23503-196]
org.h2.jdbc.JdbcSQLException: Referential integrity constraint violation: “ACT_FK_EXE_PROCINST: PUBLIC.ACT_RU_EXECUTION FOREIGN KEY(PROC_INST_ID_) REFERENCES PUBLIC.ACT_RU_EXECUTION(ID_) (‘0c06e861-55de-11e8-9e9f-0242ef8473e1’)”; SQL statement:
delete from ACT_RU_EXECUTION where ID_ = ? and REV_ = ? [23503-196]
org.h2.jdbc.JdbcSQLException: Referential integrity constraint violation: “ACT_FK_EXT_TASK_EXE: PUBLIC.ACT_RU_EXT_TASK FOREIGN KEY(EXECUTION_ID_) REFERENCES PUBLIC.ACT_RU_EXECUTION(ID_) (‘0cc333d3-55de-11e8-9e9f-0242ef8473e1’)”; SQL statement:
delete from ACT_RU_EXECUTION where ID_ = ? and REV_ = ? [23503-196]
'. Flush summary:
[
INSERT HistoricExternalTaskLogEntity[0dc67336-55de-11e8-9e9f-0242ef8473e1]
INSERT HistoricActivityInstanceEventEntity[EndEvent_14h6lw0:0dc67337-55de-11e8-9e9f-0242ef8473e1]
DELETE ExternalTaskEntity[0cc333d5-55de-11e8-9e9f-0242ef8473e1]
DELETE ExecutionEntity[0cc333d3-55de-11e8-9e9f-0242ef8473e1]
DELETE ExecutionEntity[0c06e861-55de-11e8-9e9f-0242ef8473e1]
UPDATE HistoricActivityInstanceEventEntity[Task_0pvkpo3:0cc333d4-55de-11e8-9e9f-0242ef8473e1]
UPDATE HistoricProcessInstanceEventEntity[0c06e861-55de-11e8-9e9f-0242ef8473e1]
]
at org.camunda.bpm.engine.impl.db.EnginePersistenceLogger.flushDbOperationsException(EnginePersistenceLogger.java:697) ~[camunda-engine-7.9.0-alpha3.jar!/:7.9.0-alpha3]
at org.camunda.bpm.engine.impl.db.entitymanager.DbEntityManager.flushDbOperations(DbEntityManager.java:355) ~[camunda-engine-7.9.0-alpha3.jar!/:7.9.0-alpha3]
at org.camunda.bpm.engine.impl.db.entitymanager.DbEntityManager.flushDbOperationManager(DbEntityManager.java:318) ~[camunda-engine-7.9.0-alpha3.jar!/:7.9.0-alpha3]
at org.camunda.bpm.engine.impl.db.entitymanager.DbEntityManager.flush(DbEntityManager.java:290) ~[camunda-engine-7.9.0-alpha3.jar!/:7.9.0-alpha3]
at org.camunda.bpm.engine.impl.interceptor.CommandContext.flushSessions(CommandContext.java:203) ~[camunda-engine-7.9.0-alpha3.jar!/:7.9.0-alpha3]
at org.camunda.bpm.engine.impl.interceptor.CommandContext.close(CommandContext.java:132) ~[camunda-engine-7.9.0-alpha3.jar!/:7.9.0-alpha3]
at org.camunda.bpm.engine.impl.interceptor.CommandContextInterceptor.execute(CommandContextInterceptor.java:113) [camunda-engine-7.9.0-alpha3.jar!/:7.9.0-alpha3]
at org.camunda.bpm.engine.spring.SpringTransactionInterceptor$1.doInTransaction(SpringTransactionInterceptor.java:42) [camunda-engine-spring-7.9.0-alpha3.jar!/:7.9.0-alpha3]
at org.springframework.transaction.support.TransactionTemplate.execute(TransactionTemplate.java:140) [spring-tx-5.0.4.RELEASE.jar!/:5.0.4.RELEASE]
at org.camunda.bpm.engine.spring.SpringTransactionInterceptor.execute(SpringTransactionInterceptor.java:40) [camunda-engine-spring-7.9.0-alpha3.jar!/:7.9.0-alpha3]
at org.camunda.bpm.engine.impl.interceptor.ProcessApplicationContextInterceptor.execute(ProcessApplicationContextInterceptor.java:66) [camunda-engine-7.9.0-alpha3.jar!/:7.9.0-alpha3]
at org.camunda.bpm.engine.impl.interceptor.LogInterceptor.execute(LogInterceptor.java:30) [camunda-engine-7.9.0-alpha3.jar!/:7.9.0-alpha3]
at org.camunda.bpm.engine.impl.ExternalTaskServiceImpl.complete(ExternalTaskServiceImpl.java:52) [camunda-engine-7.9.0-alpha3.jar!/:7.9.0-alpha3]
at org.camunda.bpm.engine.rest.sub.externaltask.impl.ExternalTaskResourceImpl.complete(ExternalTaskResourceImpl.java:108) [camunda-engine-rest-core-7.9.0-alpha3.jar!/:7.9.0-alpha3]
at sun.reflect.GeneratedMethodAccessor218.invoke(Unknown Source) ~[na:na]
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[na:1.8.0_162]
at java.lang.reflect.Method.invoke(Method.java:498) ~[na:1.8.0_162]
at org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory.lambda$static$0(ResourceMethodInvocationHandlerFactory.java:76) [jersey-server-2.26.jar!/:na]
… …
there are many more stack traces, but I think they are irrelevant to this problem.

Kelvin


#2

I have reconfigured the data source to postgresql and tried setting different java max heap limit. The error seems to occur during java GC. Limiting heap size with -Xmx1024m will reproduce the error when memory usage reaches 1.xG. Setting max heap to 4096m, the error will NOT occur UNTIL memory goes to 4.xG.

Hope this helps to solve the problem. My wild guess is : finished in-memory Execution/Activity objects are viewed as garbage collectible, but somehow GC-triggered batch DB operations are not compatible with DB foreign key constraints.

Kelvin


#3

Hi Kelvin,

Thanks for looking into this. Could you please share your project for reproducing this on github?

Cheers,
Thorben


#4

Hi Thorben,

Please see https://github.com/icecoolbug/bpmn-test.

It’s not really a project yet. On the server side, there is only one meaningful line of code: calculating Long.MAX-current time as task priority, and the spring boot application does nothing. To isolate the problem, I have used a postgres DB instead of H2 in-mem DB. You may change src/resources/application.yml back to default H2 though. The process definition is in src/main/resources/order-test.bpmn.

The REST client is bpmn_client/main.go. It does 3 things through Camunda REST API: start n process instances, fetch and lock external tasks (two topics defined in the process), and complete the external tasks. It has one dependency: github.com/go-resty/resty. To run it, just use command line:
go run main.go

On my laptop, running the server with “java -Xmx1024m -jar maven_packaged_jar” and running the client twice will reproduce the problem and end up with java.lang.OutOfMemoryError: GC overhead limit exceeded.

Best Regards,
Kelvin


#5

Hi Thorben,

I looked at org.camunda.bpm.engine.impl.persistence.entity.ExternalTaskEntity.java on github. I suspect the problem lies in this method (line 496-500):

@Override
public Set getReferencedEntityIds() {
Set referencedEntityIds = new HashSet();
return referencedEntityIds;
}

I think ExternalTaskEntity should put executionId into the result set, then correct flushing order of DB operations can be calculated. Current implementation returns an empty set. This is the first time I dig into the source code, so please forgive me if I am wrong.

Kelvin


#6

Hi Kelvin,

I didn’t have the time to run your project yet.

I like that you started digging into the implementation yourself :slightly_smiling_face: . Regarding your findings, the engine sorts database operations by this strategy:

  1. All INSERT statements
    1. Order statements by entity type (the order is defined in EntityTypeComparatorForModifications such that first insert referenced entities and then referencing entities and so on are inserted to avoid FK violations)
    2. Per entity type, order the statements by when the engine submitted the operation
    3. Reorder the statements by object references (as defined in HasDbReferences#getReferencedEntityIds)
  2. All UPDATE/DELETE statements
    1. Order statements by entity type (reverse order that was is used for INSERT so that referencing entities are updated first to avoid FK violations)
    2. Per entity type
      1. All single row UPDATE statements, ordered by submission time and object references (same as INSERT)
      2. All single row DELETE statements, ordered by submission time and object references (same as INSERT)
      3. All bulk UPDATE/DELETE statements

This logic is defined in DbOperationManager#calculateFlush if you want to take a look.

That means, that HasDbReferences#getReferencedEntityIds has only effect for entities of the same type. Relations of entities of different types are supposed to be covered by the entity type ordering.

All that said, there may still be cases that this ordering scheme does not catch and maybe we have one here. In order to better understand what is going, you can set the following loggers to DEBUG:

  • org.camunda.bpm.engine.persistence
  • org.camunda.bpm.engine.cmd

This will log all the commands the engine makes and a summary of the ordered statements as they are performed. Maybe that way you can identify an incorrect ordering.

Cheers,
Thorben


#7

Hi Thorben,

You are right. The statement order is actually correct. The error is caused by GC failure, and sometimes it just reports OOM. Here is what looks like in GC log right before the error occurs:

286.666: [Full GC (Ergonomics) 929315K->929094K(990208K), 0.5275452 secs]
287.197: [Full GC (Ergonomics) 929315K->929092K(990208K), 0.4295879 secs]
287.628: [Full GC (Ergonomics) 929315K->929098K(990208K), 0.2686926 secs]
287.900: [Full GC (Ergonomics) 929315K->929100K(990208K), 0.5853791 secs]
288.493: [Full GC (Ergonomics) 929308K->929083K(990208K), 0.5514679 secs]
289.058: [Full GC (Ergonomics) 929307K->929103K(990208K), 0.5633454 secs]
289.627: [Full GC (Ergonomics) 929298K->929081K(990208K), 0.5472502 secs]
290.175: [Full GC (Ergonomics) 929297K->929086K(990208K), 0.4067904 secs]
290.583: [Full GC (Ergonomics) 929297K->929088K(990208K), 0.2694454 secs]
290.853: [Full GC (Ergonomics) 929297K->929088K(990208K), 0.5399290 secs]
291.394: [Full GC (Ergonomics) 929295K->929089K(990208K), 0.2639929 secs]
291.659: [Full GC (Ergonomics) 929295K->929090K(990208K), 0.5622436 secs]
292.222: [Full GC (Ergonomics) 929295K->929088K(990208K), 0.4634268 secs]
292.686: [Full GC (Ergonomics) 929295K->929088K(990208K), 0.2771819 secs]
292.964: [Full GC (Ergonomics) 929295K->929091K(990208K), 0.5245882 secs]
293.489: [Full GC (Ergonomics) 929295K->929096K(990208K), 0.2645995 secs]
293.757: [Full GC (Ergonomics) 929295K->929082K(990208K), 0.5751693 secs]
294.347: [Full GC (Ergonomics) 929287K->929050K(990208K), 0.7419616 secs]
295.126: [Full GC (Ergonomics) 929765K->929202K(990208K), 0.5677729 secs]
295.696: [Full GC (Ergonomics) 929761K->929204K(990208K), 0.2696693 secs]
295.971: [Full GC (Ergonomics) 929711K->929301K(990208K), 0.2651881 secs]
296.253: [Full GC (Ergonomics) 929761K->928721K(990208K), 0.6077022 secs]
296.868: [Full GC (Ergonomics) 929375K->928713K(990208K), 0.2691628 secs]
297.169: [Full GC (Ergonomics) 929375K->928828K(990208K), 0.4776897 secs]
297.657: [Full GC (Ergonomics) 929374K->929038K(990208K), 0.2783507 secs]
297.944: [Full GC (Ergonomics) 929374K->929010K(990208K), 0.2734516 secs]
298.222: [Full GC (Ergonomics) 929374K->929172K(990208K), 0.2917520 secs]
298.521: [Full GC (Ergonomics) 929374K->929308K(990208K), 0.2720952 secs]
298.794: [Full GC (Ergonomics) 929374K->929347K(990208K), 0.2709119 secs]
299.066: [Full GC (Ergonomics) 929374K->929343K(990208K), 0.2870686 secs]
299.355: [Full GC (Ergonomics) 929374K->929297K(990208K), 0.2812738 secs]
299.650: [Full GC (Ergonomics) 929374K->928731K(990208K), 0.4332400 secs]
300.225: [Full GC (Ergonomics) 929374K->928646K(990208K), 0.6938743 secs]
300.920: [Full GC (Ergonomics) 929779K->928628K(990208K), 0.5445955 secs]
301.542: [Full GC (Ergonomics) 929779K->925390K(990208K), 0.7166940 secs]
304.402: [Full GC (Ergonomics) 929621K->921285K(990208K), 0.7225028 secs]

As you can see, GC takes almost all running time but can not reclaim enough memory even if previous processes have finished execution. There must be memory leaks somewhere or connections holding resources for too long (probably some keep-alive settings?). Without any experiences of how bpm process engine works, I don’t know where to look at.

I also downloaded latest camunda-bpm-platform master branch, but “mvn clean test -PtestBpmn” failed with following error:

Results :

Failed tests:
HistoryCleanupRestServiceInteractionTest.testHistoryConfigurationOutsideBatchWindow:177 1 expectation failed.
JSON path batchWindowEndTime doesn’t match.
Expected: a string containing “2017-09-03T06:00:00.000+0800”
Actual: 2017-09-02T06:00:00.000+0800

MetricsRestServiceInteractionTest.testGetInterval:89 3 expectations failed.
JSON path [0].timestamp doesn’t match.
Expected: 1970-01-01T01:45:00.000+0800
Actual: 1970-01-01T08:45:00.000+0800

JSON path [1].timestamp doesn’t match.
Expected: 1970-01-01T01:30:00.000+0800
Actual: 1970-01-01T08:30:00.000+0800

JSON path [2].timestamp doesn’t match.
Expected: 1970-01-01T01:15:00.000+0800
Actual: 1970-01-01T08:15:00.000+0800

It seems to be a local time zone problem.

Regards,
Kelvin