HA database cluster ruine process engine

mind · October 12, 2021, 11:20am

Greetings!

Problem: process engine doesn’t switch process activities, also logs have an optimistic lock exception and activity not found cases.
Env 1: Camunda 7.11, Spring Boot 2.3.1.RELEASE, pgpool, PostgreSQL 9.6.9 (two nodes: master, slave; async replication)
Env 2: Camunda 7.15, Spring Boot 2.4.3, pgpool, PostgreSQL 9.6.9 (two nodes: master, slave; async replication)
Env 3: Camunda 7.15, Spring Boot 2.4.3, pgpool, PostgreSQL 9.6.9 (one node)
One more detail - pgpool is balancing queries: write queries are route to master, all other queries translating to slave.

All processes builded on external tasks, my normal flow to execute tasks:

fetchAndLock tasks and send all to required microservices (without blocking).
Camunda receive microservice reply and doing completeTask
Complete task is throw exception what external task isn’t blocked (locked by null), in some unknown cases specified activity ID not found.

Im try to add some delay between fetch and complete (before complete task - 10 secs), all problems with complete task (only with task lock) is disappeared. My detail research is show: postgresql slave having a big replication lag (from 1 second to 6 second) and slave isn’t update from master in time between fetch and complete: process engine write a lock, in complete task stage engine gets external task from slave where’s lock isn’t writed.
Camunda documentation says: “Clustered or replicated databases are supported given the following conditions. The communication between Camunda Platform and the database cluster has to match with the corresponding non-clustered / non-replicated configuration. It is especially important that the configuration of the database cluster guarantees the equivalent behavior of READ-COMMITTED isolation level.”.
Im force database isolation level to READ COMMITTED (but it is by default has READ-COMMITTED). As expected its not fixing problem.

I’m have a massive problems with process migration and any other interaction with camunda (where camunda engine operates read and write queries I think): history writing, fast activity changes and other.

On ENV 3 problem is not reproduced, exclude strange case with activity ID.

Any ideas or experience with HA DB environment?

Camunda configuration:

spring:
  jpa:
    database: POSTGRESQL
    hibernate.ddl-auto: create-drop
    show-sql: true
  datasource:
    platform: postgres
    url: ${DB_DEFAULT_URL}
    username: ${DB_DEFAULT_USERNAME}
    password: ${DB_DEFAULT_PASSWORD}
    hikari:
      maximum-pool-size: ${DB_MAX_CONNECTIONS:100}
      minimum-idle: ${DB_MIN_CONNECTIONS:10}
      transaction-isolation: TRANSACTION_READ_COMMITTED
  jersey:
    application-path: ${CAMUNDA_REST_PATH:engine-rest}
camunda.bpm: 
  generic-properties:
    properties:
      history-cleanup-metrics-enabled: true
      batch-operation-history-time-to-live: P2D
      history-time-to-live: P7D
      history-cleanup-strategy: removalTimeBased
      history-removal-time-strategy: end
      history-cleanup-batch-window-start-time: "20:00"
      history-cleanup-batch-window-end-time: "04:00"
      history-cleanup-degree-of-parallelism: 4
  # Main Engine Configurations
  enabled: true
  process-engine-name: default
  generate-unique-process-engine-name: false
  generate-unique-process-application-name: false
  default-serialization-format: default
  history-level: FULL
  history-level-default: FULL
  auto-deployment-enabled: true
  default-number-of-retries: 3
  job-executor-acquire-by-priority: false
  job-executor-acquire-by-due-date: true
  id-generator: strong
  deployment-resource-pattern: classpath*:bpmn/*.bpmn
  # Jobs
  job-execution:
    enabled: true
    deployment-aware: true
    core-pool-size: 8
    keep-alive-seconds: 0
    lock-time-in-millis: 10000
    max-jobs-per-acquisition: 16
    max-pool-size: 8
    queue-capacity: 128
    wait-time-in-millis: 1000
    max-wait: 60000
    backoff-time-in-millis: 0
    max-backoff: 0
    backoff-decrease-threshold: 100
    wait-increase-factor: 2
  # Datasource
  database:
    schema-update: true
    type: postgres
    jdbc-batch-processing: true
  # Eventing
  eventing:
    execution: true
    history: true
    task: true
  # JPA
  jpa:
    enabled: true
    #persistence-unit-name: JPA persistence unit name
    close-entity-manager: true
    handle-transaction: true

MaximMonin · October 12, 2021, 8:01pm

As far as i know camunda recommends sync replication, instead async.
In your case async is used (i use it too), but i dont use replica db for read operation as you do.
I think read operations with 1-6 sec delay will gives engine wrong info about commited transactions, so i am not sure this can work at all stable.

I tried cockroach but there are other problem, all clustered db use Serializable or Repeatable reads isolation levels. they cannot work in read commited mode. And it is fundamental problem - camunda cannot norm work with db cluster.

mind · October 12, 2021, 9:15pm

Thanks for answer!

I’m trying to get more tests with synchronous replication on this week. And yes, my delay is really bad idea: camunda work with it not perfectly - strange errors and bad performance in a little load (around 10 processes starts everyday with maximal deep of subprocesses 5, plus few parallels; didn’t heavy and not complex).
I’m think synchronous replication is too slow (any write query must wait to refresh all slave nodes as result X+max(slave_lag) where X is time to execute query in master node), but it’s is better then non-controllable static delay.

system · January 30, 2024, 1:28pm