Reapply WorkflowExecutionOptionsUpdated history event by rodrigozhou · Pull Request #7221 · temporalio/temporal

rodrigozhou · 2025-02-03T17:59:32Z

What changed?

Reapply WorkflowExecutionOptionsUpdated history event in conflict resolution

Why?

How did you test it?

Potential risks

Documentation

Is hotfix candidate?

bergundy

LGTM, I would appreciate at least a manual test confirming that a second start request would be deduped post replication.

yycptt

Code LGTM, just want to confirm desired behavior.

yycptt · 2025-02-04T06:12:08Z

service/history/ndc/workflow_resetter.go

Two questions:

So we think it makes sense to reapply this event even across runs?

Will the callback logic get triggered on a run that get terminated due to version conflict?

So we think it makes sense to reapply this event even across runs?

What do you mean by "across runs"?

Will the callback logic get triggered on a run that get terminated due to version conflict?

Currently, callbacks are called when the workflow is completed, except if it's continued as new or reset.

So we think it makes sense to reapply this event even across runs?

I'm assuming you mean a workflow reset? If so then yes, it has similar semantics to a signal, but I don't see a reason to allow opting out of it ATM.

Will the callback logic get triggered on a run that get terminated due to version conflict?

We didn't explicitly test that, so I can't say for sure. I'm not 100% what this scenario is, would another run take over as current and the losing run made a zombie? If that's the case, can we copy over the callbacks to the conflict winner? I'm not sure it completely makes sense to do that, it's kind of undefined behavior, probably best to trigger the callback and report termination of the loser and not apply to the winner.

you mean a workflow reset?

No I mean reapply from zombie run to current run. I think it makes sense, so just confirm here.

would another run take over as current and the losing run made a zombie?

So all zombie runs will eventually be terminated (when the cluster creates the zombie run discovers there's a newer new run, it just terminate the run it has instead of putting it into zombie.)
The situation I am thinking is one workflow W starts a workflow W1 in cluster A, after a (forced) failover, it creates re-runs the operation and creates W2. W1 will eventually be terminated and W2 may still be running. Once terminated, callback in W1 will (or maybe not) run. So now W will see a failed nexus operation despite that W2 is still running.

can we copy over the callbacks to the conflict winner?

I think it's already the case since we are reapplying WorkflowExecutionOptionsUpdatedEvent from zombie to current run? If the zombie (eventually terminated) run will also try to report back then it's a race condition. and may not be the behavior we want.

How/Where in the code the zombie workflow is terminated? Is it here?
How do we know it's a zombie there?

Updated the PR to not process the callbacks when it's a zombie workflow.

The situation I am thinking is one workflow W starts a workflow W1 in cluster A, after a (forced) failover, it creates re-runs the operation and creates W2. W1 will eventually be terminated and W2 may still be running. Once terminated, callback in W1 will (or maybe not) run. So now W will see a failed nexus operation despite that W2 is still running.

Discussed offline: it should be a rare case, and we'll handle this case another time.

Not processing callback for zombie workflows already fixed ^ issue since W1 will no longer report back to W. What we discussed offline is a different case where there are two levels of zombie workflows.

yycptt

I don't want to block this PR and at least the reapply logic within one run.

The zombie workflow cases I described are very rare and will only happen during forced failover, so we can address them later.

Feel free to revert to the previous version of the PR and land it.

yycptt · 2025-02-05T21:19:44Z

service/history/ndc/workflow.go

Should be safe from what I can see. but cc @xwduan @hai719 @yux0 FYI.

Yeah. I think it is safe. This also may helped on a case where the terminated events comes from suppress should not be reapplied.

yycptt · 2025-02-05T21:28:31Z

service/history/ndc/workflow_resetter.go

Not processing callback for zombie workflows already fixed ^ issue since W1 will no longer report back to W. What we discussed offline is a different case where there are two levels of zombie workflows.

yycptt · 2025-02-05T21:28:41Z

service/history/workflow/mutable_state_impl.go

Is this good enough?
I mean this will prevent the current cluster from generating callback tasks. But the termination event will eventually be replicated (in event-based replication) to another cluster, then in that cluster callback tasks will still be generated (and processed if failover happens)?

rodrigozhou · 2025-02-05T22:00:36Z

@yycptt I reverted the last commit as it's actually causing tests to fail, and there's more things to consider. I'll follow up on that in another PR.

rodrigozhou requested review from a team, bergundy and yycptt February 3, 2025 17:59

rodrigozhou mentioned this pull request Feb 3, 2025

Add OnConflictOptions to StartWorkflowExecution #7080

Merged

bergundy approved these changes Feb 3, 2025

View reviewed changes

rodrigozhou force-pushed the rodrigozhou/on-conflict-options branch 2 times, most recently from 924f1d9 to d01c49a Compare February 3, 2025 20:39

Base automatically changed from rodrigozhou/on-conflict-options to main February 3, 2025 21:53

rodrigozhou force-pushed the rodrigozhou/reapply-options-updated branch 2 times, most recently from 3b35c30 to 0f01fd2 Compare February 4, 2025 03:07

yycptt reviewed Feb 4, 2025

View reviewed changes

yycptt reviewed Feb 5, 2025

View reviewed changes

rodrigozhou added 2 commits February 5, 2025 13:59

Reapply WorkflowExecutionOptionsUpdated history event

1668a08

add more tests

e0a33b3

rodrigozhou force-pushed the rodrigozhou/reapply-options-updated branch from bef80ed to e0a33b3 Compare February 5, 2025 21:59

rodrigozhou merged commit 06dadf8 into main Feb 6, 2025

rodrigozhou deleted the rodrigozhou/reapply-options-updated branch February 6, 2025 00:22

Conversation

rodrigozhou commented Feb 3, 2025

What changed?

Why?

How did you test it?

Potential risks

Documentation

Is hotfix candidate?

Uh oh!

bergundy left a comment

Choose a reason for hiding this comment

Uh oh!

yycptt left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rodrigozhou Feb 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yycptt left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rodrigozhou commented Feb 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

rodrigozhou Feb 4, 2025 •

edited

Loading