How to performance tune ExternalTaskClient usage?

Is there any documentation of how to maximize performance when using the camunda-external-task-client-java library [0] ? Trying to find a balance between responsiveness (tasks are picked up and completed quickly), database load (not too many big queries on the DB), throughput (keeping up when a large number of process instances are being started), and service CPU usage. We’re also unclear about how the library works with multiple stages/topics, and whether they should use individual ExternalTaskClient objects, share them, thread pool them, etc., particularly with respect to a somewhat large number of different stages (we have 17 different stages that are executed in external-task implementations).

Initally we used a single ExternalTaskClient, which we had all of our stages use to subscribe through, and with a low maximum backoff (50ms between requests). We were trying to maxmimize responsiveness, but that placed a quite high load on our DB, and couldn’t keep up when we needed higher task throughput.
We then switched to using a couple separate ExternalTaskClient objects, because we thought the library might not efficiently “share” one across multiple topic subscriptions. Some of these ExternalTaskClients had a higher or lower MaxTasks setting, depending on how quickly the work would complete. This helped, but still didn’t seem to give us the throughput we expected in stages that were sharing an ExternalTaskClient among multiple stages.
That led us to switching to a model where each stage used its own ExternalTaskClient, but we’re finding that this is resulting in quite a high CPU usage for our service.

The only documentation I’ve been able to find for performance tuning this is here [1], and it’s extremely limited. I can’t tell if the TopicSubscriptionManager [2] is safe or performant to share among multiple stages each subscribed to different topics. Is there any further documentation with details about what the cost/benefit tradeoffs are of adjusting different settings in this library?

[0] GitHub - camunda/camunda-external-task-client-java: This codebase was merged with https://github.com/camunda/camunda-bpm-platform. Only some maintenance branches might still be active.
[1] External Task Client | docs.camunda.org
[2] https://github.com/camunda/camunda-external-task-client-java/blob/master/client/src/main/java/org/camunda/bpm/client/topic/impl/TopicSubscriptionManager.java

1 Like

@Megan_Lovett I’m experimenting in this same area.

I deployed a basic test for a workflow with all external tasks to try and test throughput. Each task had a sleep of 200ms. Each workflow had 8 external tasks at 200ms so if ran it in memory I’d expect it to be close to 2s. With my testing today I ran around 4k workflows and I’m seeing each a workflow take an average of ~30s. I’m creating workflows at about 30 per minute - my team needs to support 100x this at least. As a result of these tests our db cpu reaches 90-100% relatively quickly which would explain a good slow down.

If anyone has some guidance or thoughts It would be much appreciated.

I’m experimenting with the number of clients and the number of tasks that they pull. The test above had clients that would pull 1 task at a time and I have around 20 of them polling with exponential backoff.

Try enable longpolling and minimize polling interval to very low value (instead 300ms default). I used 5ms.Actually it doesnt spam engine. It gives rest call once per LongPolling interval (used 60 sec)

// for fast parallel processing it is critical to reduse polling internal to low value
// create a Client instance with custom configuration
const config = { baseUrl: url, workerId: workerId, use: logger.level(loglevel), asyncResponseTimeout: longPolling, lockDuration: 50000, 
  maxTasks: maxTasks, interval: 5, autoPoll: false, interceptors: [basicAuthentication /*, sslValidation */] };
const client = new Client(config);
const topicSubscription = client.subscribe(tasktype, {}, async function ({task, taskService}) {
//  console.log (JSON.stringify(task)); 

  await router (task, taskService);
});
client.start();

It is ok to have 10-30ms between any External task call with async before = true (Camunda engine used large sql to find out new task to execute + there is time to commit every variable input/output + time to fetch new external task).

And I used router with 1 subscription and routing by task attributes
const { processInstanceId, processDefinitionKey, activityId } = task;

P.S. it applies for js client. java client have long poll interval and doesnt have Internal interval value and used some backoff strategy)

1 Like