I’ve been digging deep into MySQL performance tuning and monitoring tools and there’s a huge swath of possible changes that can be made depending upon the configuration.
My #1 objective in testing is to determine whether or not we have literally run out of server I/O capacity. If that is the case, no amount of tuning is going to help. One challenge here is that observing what’s happening in detail on both the Camunda and database servers, and correlating all of it imposes its own overhead because of the volumes we would be pushing.
One final top-level thought that has some traction in our team is that we feel Camunda has not indexed as much as they should. This is not based upon any “smoking gun” evidence, but MySQL Enterprise Monitor is constantly complaining about inefficient index use. In previous versions of Camunda (7.4.2 was the last we tried this on), we were getting awful Camunda Cockpit performance on a large instance, so we tracked down the tables used, added indices and performance was 10X better. This would explain some, but not all of our issues. However, I’ve always operated under the assumption that Camunda have already optimized the schema (and related SQL scripts that create them), but perhaps they never anticipated the volumes we are running.
Here are a couple of details on our setup:
- Primary server acts as a REPLICATION MASTER to another server on the same subnet. Replication is asynchronous.
- The database server is only used for Camunda and nothing else. Camunda has complete control of the database. The only things accessing the database are: Camunda BPMN engine/REST API, Camunda adminstration GUI web application, MySQL Enterprise Monitor Agent.
- We do not use an XA datasource.
- We do use a largish connection pool (minimum 100 connections) and everything uses SSL (this is required).
- Transactions tend to be extremely active in that every task is generally waiting for something (usually and external web service call) to complete, and will then move on to next activity. There are no intentionally long running processes (e.g. human interactive tasks) currently. Everything executes as fast as the servers will allow.
Here’s a sample of our datasource configuration, which I have modified to remove proprietary information as well as to make it easier to read. We use the commercial MySQL Connector/J driver. We are also going through a local load balancer which supports a vURL, which points only to a single database server. The load balancer acts as a “switch” in the event rapid failover is required.
<datasource jta="true" jndi-name="java:jboss/datasources/ProcessEngine" pool-name="ProcessEngine" enabled="true" use-java-context="true" use-ccm="true" statistics-enabled="true">
<driver name="mysql" module="com.mysql">
One question I can’t figure out is, do each of the core Job Executor threads actively “poll” the database for a task? If that were the case, then setting a high “core-thread” count would result in useless chatter on the database server. In our case, if we get a backlog of 100,000 active processes, each of which actually wants to do something or is actively waiting for its task to complete, then where do we go?
One challenge we face is a classic “network engineering” problem of how much do you build to peak hour volume. To be sure, it depends upon your SLA, but economically one typically wouldn’t build for volumes that occur only 1% of total operation time unless an SLA dictated it. If we need truly deterministic behavior like you would expect from a real-time operating system or application, then we would need to have a very fine grained understanding of the performance of server element of every task and then establish and a performance envelope around that.
My gut tells me what we’re going to be doing is going to a truly clustered database that provides horizontal (adding servers) scalability. The problem with most “clustered” databases is that they do not actually perform as a single, unified instance (i.e. share most), but rather as a group of individual instances that use a high-performance synchronization process to keep them in sync. Camunda’s READ COMMITTED isolation level requirement, which I completely understand, imposes additional challenges, particularly when a node in the cluster gets slightly behind. In this scenario, a query could be sent to it that would return a value of a committed transaction that did not reflect the current value elsewhere on the cluster. READ COMMITTED only guarantees you only get results from committed transaction and you don’t get “dirty” reads. It does not guarantee that the value of a particular row on a specific node of the cluster is in fact the latest value.
The consequence of this is that if Camunda or your process logic is dependent upon “serial” execution of database updates based upon the finite, unique time they occurred (no two events truly occur at the same time, though the resolution of time keeping on the server dictates a practical limit to how you can confidently “order” transactions), then you could have problems. In essence, the state of any row in the database must be the same on every server at the same time, which is not possible with clustering technology like Galera or MySQL’s new GROUP replication. In fact, the only MySQL technology I know of that provides a truly monolithic, scalable instance, is MySQL Cluster NDB, which we’ve never used.
I’m sure this is going to be a tedious, iterative process. I’ll share as much of my experience, tooling, and results with the community as I can.
As always, thanks for your advice and your contributions to the Camunda community.