Good details. Since I'm catching up on this today I'll focus on the following...
Tasks are waiting on a reply back from web-service calls (ReST or SOAP clients):
Assuming, given the term "waiting", these are not asynchronous. In-other-words, these task implementations (i.e. java delegates) are synchronous JAX-RS/WS clients.
It sounds like the BPM-engine is invoking synchronous Java web-service client requests against observably long-running SOA-provided operations. Given throughput requirements, this isn't good - though reasonably corrected. Additionally, beefing up both BPM container and resource (DB, etc.) capacities doesn't necessarily address the core problem that we're trying to scale up, from the start, a large collection of synchronous web-service client instances (task implementations or delegates).
Recommended next steps:
1) Review task-delegate source to determine web-service client architecture (sync/async)
2) Estimate, per capacity requirements, what this means at scale. For example, how many tasks waiting per process instance? How many process instances?
3) If synchronous, and determined to lead or cause our performance bottleneck, refactor to request/reply EAI pattern or alternate workaround. I like messaging because it provides additional features for recovering state and message-event reliability (i.e. dead-letter, message time-out, transaction, load-balancing, etc. ).
Additionally, and given the need for high-capacity, I'd avoid using BPMN timers as a means of escalation on late service response. For example, rather than each task instance invoking a BPMN timer event, instead off-load this requirement to a specialized service. In this case, I'd load the events representing a SOA-client wait-state, into a message and set this message "time-out" per response escalation. And, this is only due to capacity (scale) requirements because I prefer the BPMN timer business-oriented representation.
Looking back on recommended next steps highlights our problem in that it appears that we have too much service-oriented function within current process implementation (such as binding BPM-task implementation into the role of Request-Reply services). It's because we're essentially looking to offload the "waiting for SOA response" back to the SOA layer (application services) - and, given that the act of "waiting" doesn't bode well for capacity/throughput requirements.
Large task implementation payloads - hitting the 4k limitation:
- note: interesting analysis on this at SO. For today at least, I'll simply class this as "big".
The BPM-engine's token data-store isn't an ideal place to manage records of this size. This highlights the difference, or argument, between BPM state management vs SOA. Though the BPM-engine plays a significant role, as process-manager, in the overall service-stack... this doesn't necessarily mean it's ideally suited for managing big JSON objects.
Marshaling "heavy" process variables tends to cause the following:
1) BPM-engine eats up memory in the application container (Wildfly) as these JSON objects are (a) read into process variables and (b) marshaled between persistent states during execution (awake) and sleep (waiting).
2) Wildfly begins "thrashing" as session management leads to a serious bottleneck - i.e. too much computing resource dedicated to in-memory object reference and marshaling.
3) Increasing Wildfly's thread and memory may provide a temporary fix - but, then requires increasing CPU resource for managing this increasing object reference.
Recommended next steps:
1) review JSON objects and ask "what's really necessary" for BPM-engine state management
2) Offload large JSON object persistence to dedicated systems. For truly BIG capacity... maybe even look to grid-computing. We have options given the interest in "big data" (old problem... new name). Interesting topic for a later discussion.
Telling metrics here regarding performance and capacity planning. Excellent point:
Referring back to above point regarding the management of "heavy" JSON objects.
Final note... (before I'm distracted with other work): Adding DB indexes may only provide a temporary, short-lived fix. Apologies if this sounds obvious, but adding an index will actually increase time required for storage (volatility). So, you'll see a short-lived performance increase that is later followed by an increasing latency reflecting additional effort now required for index-management. I'd enjoy sharing some stories on this ONLY because the recommendation to add indexes is typically made by our DBMS experts prior to code, or transaction/function, review. NOT saying this is the wrong approach... just saying.