Performance - Completion of external tasks

excelbrium · April 29, 2017, 11:23am

Hey everybody,
I currently have a problem and try to seek advice from you. I do really hope you are able to help me or tell me where I took the wrong lane.

Over the course of the last few weeks and months me and my colleagues have been working on getting Camunda to run in our use case scenarios. Well it runs but we are not satisfied with the performance at all.

To maybe make clear what we are facing:
Let’s say we have a business process with just a single activity.
That business process will always be started a few hundred thousand if not million times.
That activity is just meant to be a delegate to a REST based micro service that is able to handle a batch request in a matter of seconds if not milliseconds (we invested a lot of time and effort into getting it to the point it is that fast).

The activity is marked as an external task with a topic, input / output defined, and so on.

I implemented a simple solution, based on reactive streams that polls the programmatic API of the ExternalTaskService for that specific topic and tries to fetchAndLock a pre-defined number of external tasks.
Individual request objects of each external task are aggregated into a single batch request and a REST-call is sent to the micro-service. The result is then split into individual responses again and corresponding external tasks are completed, one after another, individually, programmatically at the ExternalTaskService.

So fas so good. Luckily, polling seems okay and does not cost that much performance (as fas as I have observed). Sadly, completing each individual external task takes just ages. On a medicore laptop it takes 10 to 30 seconds to complete around 500 to 600 external tasks, with a h2 in-memory db. Even longer with a production grade Oracle database in the background.
While that single endpoint of our system (the microservice I talked about) takes a few seconds at most to make significant computations for millions of customers on that medicore laptop, the performance of that task completion is just not what we can accept. I understand that state has to be persisted, and so on and I am okay with that. I would just not have thought that we’d have such a significant drop in performance by taking that route.

The production scenario will most likely be more complex diagrams and 1 to n aggregations for many more activities. While not exclusively the use case, we wanted to use that pattern (aggregation and batch processing) still a lot. This leads to the problem that we have following tasks that will have to wait until the previous activity is completed for all customers or we will have to take smaller batches.

Are external tasks the wrong tool for that use case? Did I or did we miss some kind of configuration parameters?
I’m really looking forward to your replies. Thank you for your invesment of time, in advance!

garysamuelson · April 29, 2017, 1:36pm

Interesting issue or questions…
You’re referring to “reactive streams” - this is rxJava… right?

I’m also using rxJava and am hoping to push Camunda to high capacity.

Here are my observations and recommendations…

The scenario:
1) A large number of process instances start
2) The process model (or type) contains an external task.
3) You then have a different application dedicated with the responsibility of “fetch-and-lock” n-number of external-task instances. We’ll call this the “external task driver”.

Since you mentioned reactive streams, I’ll assume you have an “observable” and a “repeateWhen” emitter - goal being the launching of a polling service pointed at Camunda. This polling service issues the fetch-and-lock request to Camunda’s engine (via ReST or Java?).

You then have an Observer keeping track of the results generated/emitted by the observable polling work.

4) re: “aggregated into a single batch request and a REST-call is sent to the micro-service.”

So… why use an “external task”? It seems to be implemented as designed - meaning it’s just there keeping track of external work effort - right? But, for high-capacity systems (and given you’ve run into a performance issue), there are options. This external task appears to be adding unnecessary overhead as a result of the massive number of fetch-and-lock requests necessary given associated ‘n-number’ of process instances.

Assuming you want to keep track of effort and duration. And, maybe there’s some escalation in the process depending on the results of external activities. We have options. But, escalation services shouldn’t be reporting on delays caused by their own existence (I’ve actually run into this situation whereby a report illustrates it’s own fault in causing the delays it represents - a sort of self-actualizing concern ).

In my opinion, I’d drop the “external tasks” from the process model. Too much overhead per your experience given capacity requirements. And, though I’m not sure where your ReST clients exists, I’d also keep these outside the delegates. This is only due to said capacity constraints. Better to batch these off to a specialized provider tuned for high-capacity (meaning you’re reusing connections - shaving milliseconds at every step).

Note: I’m also working on the same sort-of system goals… here’s a basic example:

Streamlining the process… avoiding potential overhead - model refactored:

Other recommendations:

Make sure you run a full purge of both deployments, token instance, and history BEFORE each performance/profile test. If you see a bottleneck, verify assumptions. My approach is to fully purge the unit-test system. You may be running into a poorly maintained database (i.e. “very long time for fetch-and-lock” transactions).

excelbrium · April 29, 2017, 3:02pm

Hey Gary,

thank you very much.

To clear things regarding reactive streams up:

The chain goes something like: Observable.generate(...).buffer(...).flatMapIterable(...).subscribe(...)
generate does emit periodically, no need for repeatWhen() or repeatUntil().

fetchAndLock is achieved through direct API calls: ProcessEngines.getDefaultProcessEngine().getExternalTaskService().fetchAndLock(...).topic(...).execute();

I omitted subscribeOn and observeOn here (buffer observes on Schedulers.computation() by default, btw.). The chain is executed async on our systems.

REST call logic goes into flatMapIterable. It’s just a Jersey 2 Http Client doing REST-calls to our service. So all “worker” logic is provided with and executed within the same JVM as the Camunda runtime itself.

fetchAndLock does NOT account for a huge loss in performance. What actually costs us a lot of performance is
ProcessEngines.getDefaultProcessEngine().getExternalTaskService.complete() a few million times.

I will try to take a deper look at your scenario as soon as possible.

Real edit:
I can see the same effects when ditching everything regarding RxJava and doing just a simple for loop for all results and then calling complete(). It just takes too long.

FrVaBe · May 8, 2017, 12:27pm

Hi Oliver,

as I have no experience with Camunda ExternalTasks your post made me curious about what performace to expect here. I therefore created a very simple test project (camunda-external-task-demo (tag ‘externalTask-bulk-test’)) with a bulkCompleteTest and I want to share my observations here.

The test manages to complete some hundreds of external tasks via the Java API and with the h2 database per second. The value will probably be (noticeably) lower in a real scenario (REST API, full database, including variables, buisy Process Engine, …).

At the end this seems far away from the performance your use case requires. Maybe it would be a good idea if @Camunda would provide some hints on what to expect when talking about camunda and performance. At best not only at the process level but also for some basic operations like e.g. completing external tasks. Then disappointment could possibly be avoided.

Best regards from the other side of the town
Franz

system · January 30, 2024, 10:51am