Process Instances completely disappeared, where to start the investigation?

Hello there,

this morning four instances of a process vanished from Camunda. I was basically watching the cockpit as it happend. All two people with the necessary privileges to delete an instance did absolutely not delete the instances. Via API we looked at the history/process-instance and process-instance logs but the process instances in questions are gone entirely.

In fact, this is the second time it has happend, last time with a different process definition.

The process definitions and deployments are also still there, but the instances are gone.

As far as we can tell, the process definitions are fine. The instances from the two processes had different states, i.e. the tokens were at different tasks. As far as we can tell, the instances were waiting for further input and did not disappear directly after some interaction within the instances.

Unfortunately, due to the structure of our organization, it will take us some time to get more system logs to analyze. So I was hoping that anybody here has any idea, what the cause of this could be? Some error with the postgresql database?

What kind of process is it? User tasks, system tasks or both?
How are your services implemented? Java, external tasks, script?
How if your history configured? Full, auto or off?

Do you have a clustered setup for the engine or the database.
What version of the engine are you using?

Sorry for getting back to you so late, I’ll try to answer sooner next time.

The process, lets call it A, is made up of one service task and receiving message events. Some message events are connected via event gateway. The process will be expanded in the future, but for now only these elements.

Another process B is sending process A messages from time to time to advance the process. For now, the communication is unidirectional, i.e. only from B to A.

We have checked the process defintions but to best of our knowledge, both processes are well defined. Process B has been runnning for quite some time and process A is quite simple and easy to check.

We implement our services with Java classes or use external tasks with microservices.

Our history level is FULL.

We run 7.12. Our setup encompasses two instances, a production and a staging system, both live in a docker container each. We use nginx as reverse proxy. We use postgresql databses. Every night we shutdown both systems, copy the camunda database from the production to the staging and reboot the systems. Due to our suboptimal setup, we also have to reboot the systems when we deploy new processes.

Meanwhile, we got ourselfs some catalina logs. Unfortunately, due to our suboptimal setup, we lost all older logs, so we can only work with some logs from the incident…We are working with our IT department to rectify that, but…well…

I attached the logs to this post.

Now it looks to me as if the database is trying to do a batch operation to insert 5 execution entities and one of the 5 jobs is failing, and the database is rolled back. And somehow this deleted the process instances. But, I’m totally lost as to how.

Interestingly, the referenced task “Task_13rld24” is a sub-process of process B, the tasks which sends messages to process A, which instances have gone missing. We can’t find the task instances of Task_13rld24 in the history database, nor the execution-environments of these tasks.

So there might be exceptions here, but in the end, we are not even sure, if these exceptions are related to our missing process instances. And if they are, how?

So far, we disconnected the sending message events from process B and have stopped using process A completely, as we don’t understand, what is happening.

catalina logs mod.txt (89.7 KB)

We solved the issue by deleting the subprocess in process B, which included the Task_13rld24 and moved its tasks upwards into process B directly. We also un-deployed the subprocess and deleted all instances, where it was running.
The issue is now solved, however we still do not know how it did arise in the first place. Well…

1 Like