Reapply WorkflowExecutionOptionsUpdated history event#7221
Conversation
bergundy
left a comment
There was a problem hiding this comment.
LGTM, I would appreciate at least a manual test confirming that a second start request would be deduped post replication.
924f1d9 to
d01c49a
Compare
3b35c30 to
0f01fd2
Compare
yycptt
left a comment
There was a problem hiding this comment.
Code LGTM, just want to confirm desired behavior.
There was a problem hiding this comment.
Two questions:
- So we think it makes sense to reapply this event even across runs?
- Will the callback logic get triggered on a run that get terminated due to version conflict?
There was a problem hiding this comment.
So we think it makes sense to reapply this event even across runs?
What do you mean by "across runs"?
Will the callback logic get triggered on a run that get terminated due to version conflict?
Currently, callbacks are called when the workflow is completed, except if it's continued as new or reset.
There was a problem hiding this comment.
So we think it makes sense to reapply this event even across runs?
I'm assuming you mean a workflow reset? If so then yes, it has similar semantics to a signal, but I don't see a reason to allow opting out of it ATM.
There was a problem hiding this comment.
Will the callback logic get triggered on a run that get terminated due to version conflict?
We didn't explicitly test that, so I can't say for sure. I'm not 100% what this scenario is, would another run take over as current and the losing run made a zombie? If that's the case, can we copy over the callbacks to the conflict winner? I'm not sure it completely makes sense to do that, it's kind of undefined behavior, probably best to trigger the callback and report termination of the loser and not apply to the winner.
There was a problem hiding this comment.
you mean a workflow reset?
No I mean reapply from zombie run to current run. I think it makes sense, so just confirm here.
would another run take over as current and the losing run made a zombie?
So all zombie runs will eventually be terminated (when the cluster creates the zombie run discovers there's a newer new run, it just terminate the run it has instead of putting it into zombie.)
The situation I am thinking is one workflow W starts a workflow W1 in cluster A, after a (forced) failover, it creates re-runs the operation and creates W2. W1 will eventually be terminated and W2 may still be running. Once terminated, callback in W1 will (or maybe not) run. So now W will see a failed nexus operation despite that W2 is still running.
can we copy over the callbacks to the conflict winner?
I think it's already the case since we are reapplying WorkflowExecutionOptionsUpdatedEvent from zombie to current run? If the zombie (eventually terminated) run will also try to report back then it's a race condition. and may not be the behavior we want.
There was a problem hiding this comment.
How/Where in the code the zombie workflow is terminated? Is it here?
How do we know it's a zombie there?
There was a problem hiding this comment.
Updated the PR to not process the callbacks when it's a zombie workflow.
The situation I am thinking is one workflow W starts a workflow W1 in cluster A, after a (forced) failover, it creates re-runs the operation and creates W2. W1 will eventually be terminated and W2 may still be running. Once terminated, callback in W1 will (or maybe not) run. So now W will see a failed nexus operation despite that W2 is still running.
Discussed offline: it should be a rare case, and we'll handle this case another time.
There was a problem hiding this comment.
Not processing callback for zombie workflows already fixed ^ issue since W1 will no longer report back to W. What we discussed offline is a different case where there are two levels of zombie workflows.
yycptt
left a comment
There was a problem hiding this comment.
I don't want to block this PR and at least the reapply logic within one run.
The zombie workflow cases I described are very rare and will only happen during forced failover, so we can address them later.
Feel free to revert to the previous version of the PR and land it.
service/history/ndc/workflow.go
Outdated
There was a problem hiding this comment.
Yeah. I think it is safe. This also may helped on a case where the terminated events comes from suppress should not be reapplied.
There was a problem hiding this comment.
Not processing callback for zombie workflows already fixed ^ issue since W1 will no longer report back to W. What we discussed offline is a different case where there are two levels of zombie workflows.
There was a problem hiding this comment.
Is this good enough?
I mean this will prevent the current cluster from generating callback tasks. But the termination event will eventually be replicated (in event-based replication) to another cluster, then in that cluster callback tasks will still be generated (and processed if failover happens)?
bef80ed to
e0a33b3
Compare
|
@yycptt I reverted the last commit as it's actually causing tests to fail, and there's more things to consider. I'll follow up on that in another PR. |
What changed?
Reapply WorkflowExecutionOptionsUpdated history event in conflict resolution
Why?
How did you test it?
Potential risks
Documentation
Is hotfix candidate?