Prometheus Metrics: Process Engine Plugin (contribution)

Sure. What I have is a Spring Boot 2.2.7 app with an embedded Camunda 7.13 and our own libs versions (according to the factory standards). Right now just running locally.
I forgot to add more about the error I’m getting.

  • What went wrong:
    Execution failed for task ‘:compileJava’.

Could not resolve all files for configuration ‘:compileClasspath’.
Could not resolve commons-logging:commons-logging:1.2.
Required by:
project : > org.camunda.connect:camunda-connect-http-client:1.4.0 > org.apache.httpcomponents:httpclient:4.5.12
project : > com.github.StephenOTT:camunda-prometheus-process-engine-plugin:v1.8.0 > org.apache.httpcomponents:fluent-hc:4.5.12
Module ‘commons-logging:commons-logging’ has been rejected:
Cannot select module with conflict on capability ‘logging:jcl-api-capability:0’ also provided by [org.springframework:spring-jcl:5.2.6.RELEASE(compile)]
Could not resolve org.springframework:spring-jcl:5.2.6.RELEASE.
Required by:
project : > com.github.StephenOTT:camunda-prometheus-process-engine-plugin:v1.8.0 > org.springframework:spring-core:5.2.6.RELEASE
Module ‘org.springframework:spring-jcl’ has been rejected:
Cannot select module with conflict on capability ‘logging:jcl-api-capability:0’ also provided by [commons-logging:commons-logging:1.2(compile)]

Do you need more information?

Regards,
Diego

From a quick look it is likely from: https://github.com/StephenOTT/camunda-prometheus-process-engine-plugin/blob/master/pom.xml#L124-L135 with major version incompatibility. (just a quick guess).

I guess it too.
I’ll investigate a little bit more about the error. If I get a solution, can I make a PR to your solution?

Regards,
Diego

Sure. A quick test if is update those deps and “hope” there is no breaking changes :slight_smile:

It wasn’t necessary :slight_smile:
I just had to add these lines in build.gradle

all {
    exclude module: 'httpclient'
    exclude module: 'commons-logging'
}

Those module are included in other dependencies.

Regards,
Diego

Hi @StephenOTT, after a couple of week I could adapt your implementation to Spring Boot 2.2.6 and Camunda 7.13.
If you’re interested, I could make a PR to your repo. But before, it’s necessary to talk about what is the best way to organize the code according to your standards.

Cheers,
Diego

1 Like

If you want to post it in the repo as a WIP/Work in progress PR i would be interested to take a look.

Hello @StephenOTT
Thanks for building this plug-in . We have integrated this with our camunda setup and has helped us to monitor the workflow metrics in a much better way. Are there any more features that will get added such as alert rules and alert configurations ?

@Sandeep_Yalamarthi can you give me some examples of features you are looking for ?

Most alerts I had imagined would be Prometheus specific configurations per implementation.

@StephenOTT We run an internal PAAS application with process engine embedded in the application on k8s in a single pod. It is observed that the pod/application is crashing frequently due to high number of asynchronous jobs(1500000) created in the background which means high number of incidents as well due to a faulty bpmn/workflow . These avoid these kind of DDOS attacks :stuck_out_tongue: it is better if we can have an alert if the number of background jobs cross a certain threshold . Same applies for any metric that affects the application/engine health.

You should be able to set a threshold in grafana. You should not need anything special. What is missing?

A new plugin has been developed that is a replacement:

The new plugin leverages micrometer and springboot actuator.

You can use any of the supported micrometer monitoring systems.

1 Like

We are running camunda 7.12 version in production. A lot of our bpmns are modelled with connector tasks making http calls to other services. To monitor all these business metircs we tried to to implement the earlier plugin.

StephenOTT/camunda-prometheus-process-engine-plugin with the default scrape frequency of 5s. But it was causing high cpu utilisation on db server and we had to disable the plugin.

Will integrating the latest plugin solve our issue. ? Have there been any reports of perfomance issues when integrated in a heavy load application with huge amount of data in the history database.
What are the ideal configurations to run this plugin??

If you re running the queries every 5 seconds then you are querying the data see every 5 seconds. A high cpu load is expected if you are queuing for large amounts of data.

Which specific metrics are you running?

Almost all of the metrics given by in the initial plugin. This is the list.

  1. camunda_metric_activity_instance_start
  2. camunda_metric_activity_instance_end
  3. camunda_metric_executed_decision_elements
  4. camunda_metric_job_successful
  5. camunda_metric_job_failed
  6. camunda_metric_job_acquisition_attempt
  7. camunda_metric_job_acquired_success
  8. camunda_metric_job_acquired_failure
  9. camunda_metric_job_execution_rejected
  10. camunda_metric_job_locked_exclusive
  11. camunda_process_definition_stats_instance_count
  12. camunda_message_event_subscription_count
  13. camunda_signal_event_subscription_count
  14. camunda_compensation_event_subscription_count
  15. camunda_conditional_event_subscription_count
  16. camunda_open_incidents_count
  17. camunda_resolved_incidents_count
  18. camunda_deleted_incidents_count
  19. camunda_active_process_instance_count
  20. camunda_active_user_tasks_count
  21. camunda_active_unassigned_user_tasks_count
  22. camunda_camunda_suspended_user_tasks_count
  23. camunda_active_timer_job_count
  24. camunda_suspended_timer_job_count

Although for few of the metrics i have done some customisations to add an extra label tenantid in the metrics . Example groovy snippet for customised counter metrics as below.


static {
    tenantsList =
            ProcessEngines.getDefaultProcessEngine()
                    .getRepositoryService()
                    .createProcessDefinitionQuery()
                    .list()
                    .stream()
                    .map(ResourceDefinition::getTenantId)
                    .collect(Collectors.toList());
}

tenantsList.forEach {
tenantId ->
long count = processEngine.getRuntimeService()
.createEventSubscriptionQuery()
.eventType("SIGNAL")
.tenantIdIn(tenantId)
.count();
    counter.setValue(count, Arrays.asList(tenantId, engineName));
}

I could recommend you disable 1, 2, and 11.

Always consider what level of detail you actually need to have visibility on. Each of those 24 items are all queries being executed, some with 1+N scenarios such as #11 where it gets a list of Defs and then for each def it does more lookups. This can be a lot of data to process, especially given your are executing every 5 seconds.

1 Like

Thanks for the suggestions @StephenOTT . As of now we increased frequency to 15 mins and disabled few 1,2,11 and few of the custom metrics and things seemed to have stabilised. I am also wondering if there is any other way of getting the full telemetry of the engine without scraping the db approach. Can camunda push metrics to prometheus collectors while creating/invoking the resources itself.??