Skip to content

Reapply WorkflowExecutionOptionsUpdated history event#7221

Merged
rodrigozhou merged 2 commits intomainfrom
rodrigozhou/reapply-options-updated
Feb 6, 2025
Merged

Reapply WorkflowExecutionOptionsUpdated history event#7221
rodrigozhou merged 2 commits intomainfrom
rodrigozhou/reapply-options-updated

Conversation

@rodrigozhou
Copy link
Copy Markdown
Contributor

What changed?

Reapply WorkflowExecutionOptionsUpdated history event in conflict resolution

Why?

How did you test it?

Potential risks

Documentation

Is hotfix candidate?

Copy link
Copy Markdown
Member

@bergundy bergundy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I would appreciate at least a manual test confirming that a second start request would be deduped post replication.

@rodrigozhou rodrigozhou force-pushed the rodrigozhou/on-conflict-options branch 2 times, most recently from 924f1d9 to d01c49a Compare February 3, 2025 20:39
Base automatically changed from rodrigozhou/on-conflict-options to main February 3, 2025 21:53
@rodrigozhou rodrigozhou force-pushed the rodrigozhou/reapply-options-updated branch 2 times, most recently from 3b35c30 to 0f01fd2 Compare February 4, 2025 03:07
Copy link
Copy Markdown
Member

@yycptt yycptt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code LGTM, just want to confirm desired behavior.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two questions:

  • So we think it makes sense to reapply this event even across runs?
  • Will the callback logic get triggered on a run that get terminated due to version conflict?

Copy link
Copy Markdown
Contributor Author

@rodrigozhou rodrigozhou Feb 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we think it makes sense to reapply this event even across runs?

What do you mean by "across runs"?

Will the callback logic get triggered on a run that get terminated due to version conflict?

Currently, callbacks are called when the workflow is completed, except if it's continued as new or reset.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we think it makes sense to reapply this event even across runs?

I'm assuming you mean a workflow reset? If so then yes, it has similar semantics to a signal, but I don't see a reason to allow opting out of it ATM.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the callback logic get triggered on a run that get terminated due to version conflict?

We didn't explicitly test that, so I can't say for sure. I'm not 100% what this scenario is, would another run take over as current and the losing run made a zombie? If that's the case, can we copy over the callbacks to the conflict winner? I'm not sure it completely makes sense to do that, it's kind of undefined behavior, probably best to trigger the callback and report termination of the loser and not apply to the winner.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean a workflow reset?

No I mean reapply from zombie run to current run. I think it makes sense, so just confirm here.

would another run take over as current and the losing run made a zombie?

So all zombie runs will eventually be terminated (when the cluster creates the zombie run discovers there's a newer new run, it just terminate the run it has instead of putting it into zombie.)
The situation I am thinking is one workflow W starts a workflow W1 in cluster A, after a (forced) failover, it creates re-runs the operation and creates W2. W1 will eventually be terminated and W2 may still be running. Once terminated, callback in W1 will (or maybe not) run. So now W will see a failed nexus operation despite that W2 is still running.

can we copy over the callbacks to the conflict winner?

I think it's already the case since we are reapplying WorkflowExecutionOptionsUpdatedEvent from zombie to current run? If the zombie (eventually terminated) run will also try to report back then it's a race condition. and may not be the behavior we want.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How/Where in the code the zombie workflow is terminated? Is it here?
How do we know it's a zombie there?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the PR to not process the callbacks when it's a zombie workflow.

The situation I am thinking is one workflow W starts a workflow W1 in cluster A, after a (forced) failover, it creates re-runs the operation and creates W2. W1 will eventually be terminated and W2 may still be running. Once terminated, callback in W1 will (or maybe not) run. So now W will see a failed nexus operation despite that W2 is still running.

Discussed offline: it should be a rare case, and we'll handle this case another time.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not processing callback for zombie workflows already fixed ^ issue since W1 will no longer report back to W. What we discussed offline is a different case where there are two levels of zombie workflows.

Copy link
Copy Markdown
Member

@yycptt yycptt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to block this PR and at least the reapply logic within one run.

The zombie workflow cases I described are very rare and will only happen during forced failover, so we can address them later.

Feel free to revert to the previous version of the PR and land it.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be safe from what I can see. but cc @xwduan @hai719 @yux0 FYI.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. I think it is safe. This also may helped on a case where the terminated events comes from suppress should not be reapplied.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not processing callback for zombie workflows already fixed ^ issue since W1 will no longer report back to W. What we discussed offline is a different case where there are two levels of zombie workflows.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this good enough?
I mean this will prevent the current cluster from generating callback tasks. But the termination event will eventually be replicated (in event-based replication) to another cluster, then in that cluster callback tasks will still be generated (and processed if failover happens)?

@rodrigozhou rodrigozhou force-pushed the rodrigozhou/reapply-options-updated branch from bef80ed to e0a33b3 Compare February 5, 2025 21:59
@rodrigozhou
Copy link
Copy Markdown
Contributor Author

@yycptt I reverted the last commit as it's actually causing tests to fail, and there's more things to consider. I'll follow up on that in another PR.

@rodrigozhou rodrigozhou merged commit 06dadf8 into main Feb 6, 2025
@rodrigozhou rodrigozhou deleted the rodrigozhou/reapply-options-updated branch February 6, 2025 00:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants