Greetings!
Problem: process engine doesn’t switch process activities, also logs have an optimistic lock exception and activity not found cases.
Env 1: Camunda 7.11, Spring Boot 2.3.1.RELEASE, pgpool, PostgreSQL 9.6.9 (two nodes: master, slave; async replication)
Env 2: Camunda 7.15, Spring Boot 2.4.3, pgpool, PostgreSQL 9.6.9 (two nodes: master, slave; async replication)
Env 3: Camunda 7.15, Spring Boot 2.4.3, pgpool, PostgreSQL 9.6.9 (one node)
One more detail - pgpool is balancing queries: write queries are route to master, all other queries translating to slave.
All processes builded on external tasks, my normal flow to execute tasks:
- fetchAndLock tasks and send all to required microservices (without blocking).
- Camunda receive microservice reply and doing completeTask
Complete task is throw exception what external task isn’t blocked (locked by null), in some unknown cases specified activity ID not found.
Im try to add some delay between fetch and complete (before complete task - 10 secs), all problems with complete task (only with task lock) is disappeared. My detail research is show: postgresql slave having a big replication lag (from 1 second to 6 second) and slave isn’t update from master in time between fetch and complete: process engine write a lock, in complete task stage engine gets external task from slave where’s lock isn’t writed.
Camunda documentation says: “Clustered or replicated databases are supported given the following conditions. The communication between Camunda Platform and the database cluster has to match with the corresponding non-clustered / non-replicated configuration. It is especially important that the configuration of the database cluster guarantees the equivalent behavior of READ-COMMITTED isolation level.”.
Im force database isolation level to READ COMMITTED (but it is by default has READ-COMMITTED). As expected its not fixing problem.
I’m have a massive problems with process migration and any other interaction with camunda (where camunda engine operates read and write queries I think): history writing, fast activity changes and other.
On ENV 3 problem is not reproduced, exclude strange case with activity ID.
Any ideas or experience with HA DB environment?
Camunda configuration:
spring:
jpa:
database: POSTGRESQL
hibernate.ddl-auto: create-drop
show-sql: true
datasource:
platform: postgres
url: ${DB_DEFAULT_URL}
username: ${DB_DEFAULT_USERNAME}
password: ${DB_DEFAULT_PASSWORD}
hikari:
maximum-pool-size: ${DB_MAX_CONNECTIONS:100}
minimum-idle: ${DB_MIN_CONNECTIONS:10}
transaction-isolation: TRANSACTION_READ_COMMITTED
jersey:
application-path: ${CAMUNDA_REST_PATH:engine-rest}
camunda.bpm:
generic-properties:
properties:
history-cleanup-metrics-enabled: true
batch-operation-history-time-to-live: P2D
history-time-to-live: P7D
history-cleanup-strategy: removalTimeBased
history-removal-time-strategy: end
history-cleanup-batch-window-start-time: "20:00"
history-cleanup-batch-window-end-time: "04:00"
history-cleanup-degree-of-parallelism: 4
# Main Engine Configurations
enabled: true
process-engine-name: default
generate-unique-process-engine-name: false
generate-unique-process-application-name: false
default-serialization-format: default
history-level: FULL
history-level-default: FULL
auto-deployment-enabled: true
default-number-of-retries: 3
job-executor-acquire-by-priority: false
job-executor-acquire-by-due-date: true
id-generator: strong
deployment-resource-pattern: classpath*:bpmn/*.bpmn
# Jobs
job-execution:
enabled: true
deployment-aware: true
core-pool-size: 8
keep-alive-seconds: 0
lock-time-in-millis: 10000
max-jobs-per-acquisition: 16
max-pool-size: 8
queue-capacity: 128
wait-time-in-millis: 1000
max-wait: 60000
backoff-time-in-millis: 0
max-backoff: 0
backoff-decrease-threshold: 100
wait-increase-factor: 2
# Datasource
database:
schema-update: true
type: postgres
jdbc-batch-processing: true
# Eventing
eventing:
execution: true
history: true
task: true
# JPA
jpa:
enabled: true
#persistence-unit-name: JPA persistence unit name
close-entity-manager: true
handle-transaction: true