-
Notifications
You must be signed in to change notification settings - Fork 738
Description
Describe the bug
If we have a start execution listener on the process level (i.e. defined on the process element, not the BPMN start event), it is possible that the process instance does not show up in the Operate UI and is not returned by the corresponding API endpoints (e.g. process instance query). The reason is a race condition during import between the process instance record and the job record for the execution listener.
Note: This problem does not occur in the new exporter we are planning to deliver with 8.78.8, due to the race condition not existing there.
- Backport to 8.6
- Also verify, test, and if necessary fix in new exporter on main branch
- Note: On main we also need to fix it in the 8.6 importer code
To Reproduce
- Deploy a process model that has a
startexecution listener on the process level - Start a process instance
- Let the job for the start listener be imported before the process instance
- This happens by chance due to an importer race condition
- It can be forced via remote debugging and orchestrating the importer threads with breakpoints in https://github.com/camunda/camunda/blob/stable/8.6/operate/importer-8_6/src/main/java/io/camunda/operate/zeebeimport/processors/ImportBulkProcessor.java#L122-L159 (i.e. let job records be imported first)
Current behavior
- The process instance does not become visible in the Operate UI
Expected behavior
- The process instance is displayed correctly
- It has no activity badges yet, because at that stage no activity is running yet
Rootcause
- The importer imports different Zeebe value types out of order, so it can happen that jobs are handled before process instances
- In https://github.com/camunda/camunda/blob/8.7.0-alpha1/operate/importer-8_7/src/main/java/io/camunda/operate/zeebeimport/processors/ListViewZeebeRecordProcessor.java#L639-L688 we update a flow node instance with job details, whenever a job is created. This has two problems:
FlowNodeInstanceForListViewEntitysets the list view's join relation property to{"name": "activity", "parent": <value>}. That means, the document is inserted into the list view index as a flow node instance, not a process instance. When the process instance record is handled after that, it upserts this document. The upsert does not overwrite thejoinRelationvalue again. So the process instance is wrongfully declared an element instance- The code triggers for process instances which it is not intended for
Workaround
Declare the listener on the start event of the process
Proposed Solution:
- Exclude
ListViewZeebeRecordProcessor#updateFlowNodeInstanceFromJobfrom handling process-instance-level records, so that it only updates true flow node instances. For this, we need to decide if the data that this method adds to the list view entity is important/relevant for process instances or not. If not, we can apply this change- Although this problem doesn't apply to the new exporter, to be consistent we should apply the same change there (i.e. not add job details to process instance documents in the list view index)
This will lead to a new (but minor) bug, where the "failed but retries left" filter does not work on the process instances with failed EL jobs on process level, because the field does not get set correctly, but additionally also for 8.7 because the search request of the internal API includes a filter for only flowNodeInstances(aka activities).
A new issue is created for this: #27318.
Other proposed solution:
Fix the job-based update to also handle process instances. This is the higher complexity fix that may have more unforeseen side effects (as now process instances can start appearing before they are properly imported). Due to our timeline and the rather minor impact that the bug introduced through the easier fix will have, we decided to not proceed with this proposition.
Links
- https://jira.camunda.com/browse/SUPPORT-24288
- https://jira.camunda.com/browse/SUPPORT-24703
- https://app.slack.com/client/T0PM0P1SA/C07UED2BZC2
- https://jira.camunda.com/browse/SUPPORT-25043
- https://jira.camunda.com/browse/SUPPORT-25176
- https://jira.camunda.com/browse/SUPPORT-25670
- https://jira.camunda.com/browse/SUPPORT-26073
### Pull Requests
- [x] main: https://github.com/camunda/camunda/pull/27294
- [x] 8.6: https://github.com/camunda/camunda/pull/27297
- [x] 8.7: https://github.com/camunda/camunda/pull/27299