Trampolining Part 2: Avoid infinite loops for Pinned workflows#9374
Trampolining Part 2: Avoid infinite loops for Pinned workflows#9374
Conversation
|
In theory, this PR could have been quite small if I had not worried/thought about the replication side of things. I think it's important to note that if we don't persist this information in history (on the event side of things), then it becomes difficult for us to rebuild the mutable state when looking at just the events. This could also cause a workflow, on a MS that has been rebuilt, to undergo an additional CAN. one could say that this is an overkill for something which is anyways the user's fault, but my knowledge on replication and how consistent things have to be kept is quite minimal, so I decided to take a stab at this. |
Co-Authored-By: Claude Opus 4.6 <[email protected]>
# Conflicts: # api/historyservice/v1/request_response.pb.go
Co-Authored-By: Claude Opus 4.6 <[email protected]>
43150c7 to
b426108
Compare
Co-Authored-By: Claude Opus 4.6 <[email protected]>
| // subsequent workflow tasks. | ||
| temporal.api.deployment.v1.InheritedAutoUpgradeInfo inherited_auto_upgrade_info = 16; | ||
| // The previous run's latest target Worker Deployment Version. Passed through | ||
| // during continue-as-new to be written on the WorkflowExecutionStartedEvent. |
There was a problem hiding this comment.
Nit: maybe also say "nil if this workflow is not initiated by continue-as-new"
| // True when this run was initiated by continue-as-new or retry and inherited | ||
| // a pinned version from the previous run. Used in the | ||
| // targetDeploymentVersionChanged condition to detect continuation runs that | ||
| // declined upgrade. | ||
| bool has_inherited_pinned_version_continue_as_new_or_retry = 114; |
There was a problem hiding this comment.
why are you including retry?
proto/internal/temporal/server/api/historyservice/v1/request_response.proto
Outdated
Show resolved
Hide resolved
# Conflicts: # go.mod # go.sum
distinguish "not set" from "set to unversioned (nil)". This fixes the case where a pinned workflow transitioning to an unversioned target was not receiving targetDeploymentVersionChanged=true. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Co-Authored-By: Claude Opus 4.6 <[email protected]>
| // After the first workflow task, the effective behavior of the workflow is determined by worker-sent values in | ||
| // subsequent workflow tasks. | ||
| temporal.api.deployment.v1.InheritedAutoUpgradeInfo inherited_auto_upgrade_info = 16; | ||
| // During the previous run of this workflow, if it exists, the server may have |
There was a problem hiding this comment.
| // During the previous run of this workflow, if it exists, the server may have | |
| // During the previous run of this workflow, the server may have |
| // The last target version for which the server set targetDeploymentVersionChanged | ||
| // to true on a workflow task started event. Updated on each workflow task start, | ||
| // set only when the server decides to set the targetDeploymentVersionChanged flag | ||
| // to true. Also read during continue-as-new/retry to pass to the new run. |
There was a problem hiding this comment.
we also want to be explicit about that if the last wt started event had targetDeploymentVersionChanged=false then this value would be set to nil.
| default: | ||
| // Otherwise — target changed + did not decline to upgrade on CaN/retry. Signal the SDK. | ||
| targetDeploymentVersionChanged = true | ||
| m.ms.executionInfo.LastNotifiedTargetVersion = &historypb.LastNotifiedTargetVersion{ |
There was a problem hiding this comment.
we should always unset this if targetDeploymentVersionChanged becomes false.
| m.ms.executionInfo.LastNotifiedTargetVersion = &historypb.LastNotifiedTargetVersion{ | ||
| DeploymentVersion: targetDeploymentVersion, | ||
| } | ||
| m.ms.executionInfo.NotificationSuppressedTargetVersion = nil |
There was a problem hiding this comment.
run 1:
- pinned to v1
- then target version changes to v2
- run 1 CaNs without upgrade -> declined value would be set to v2 for run 2
run 2: - pinned to v1
- declined value is v2
- target version changes to v1 -> no notif -> last notif value is nil
- here, because target version is now different from the declined value we can reset it. we should also check the revision number.
- target version changes back to v2
- Q1: should we notif run 2 now about v2?
- yes, to give the wf another chance to go to v2
- say at this point, wf CaNs but again without upgrade
- Q2: should it pass v2 as declined to run 3?
- yes, because it declined upgrade duh!
There was a problem hiding this comment.
ty for this; have also added the rollback test.
| // After the first workflow task, the effective behavior of the workflow is determined by worker-sent values in | ||
| // subsequent workflow tasks. | ||
| temporal.api.deployment.v1.InheritedAutoUpgradeInfo inherited_auto_upgrade_info = 16; | ||
| // The target version that the previous run's SDK declined to upgrade to. |
There was a problem hiding this comment.
that the previous run implicitly declined to upgrade to.
…ned wfs (#721) _**READ BEFORE MERGING:** All PRs require approval by both Server AND SDK teams before merging! This is why the number of required approvals is "2" and not "1"--two reviewers from the same team is NOT sufficient. If your PR is not approved by someone in BOTH teams, it may be summarily reverted._ <!-- Describe what has changed in this PR --> -WISOTT <!-- Tell your future self why have you made these changes --> - Users that could using trampolining in the near future could run into a problem, which shall only surface if the workflow author does not have AutoUpgrade as the initial CAN behaviour for their Pinned workflows. As of today, if a user does hit this case, they would have the pinned workflow CAN'ing infinite number of times. This is because a pinned workflow would commence on the pinned version (since it would have inherited pinned version set to true) and if the pinned version is different than a worker-deployment's current version, the targetDeploymentVersionChanged would be set to true in the server. This would then result in the workflow just CAN'ing all the time! - This change aims to change that by only allowing at-most one CAN for such pinned workflow executions, and it does so by remembering what the initial version was at the start of the continue-as-new. <!-- Are there any breaking changes on binary or code level? --> **Breaking changes** <!-- If this breaks the Server, please provide the Server PR to merge right after this PR was merged. --> - [PR!](temporalio/temporal#9374) --------- Co-authored-by: Claude Opus 4.6 <[email protected]>
…ned wfs (#721) _**READ BEFORE MERGING:** All PRs require approval by both Server AND SDK teams before merging! This is why the number of required approvals is "2" and not "1"--two reviewers from the same team is NOT sufficient. If your PR is not approved by someone in BOTH teams, it may be summarily reverted._ <!-- Describe what has changed in this PR --> -WISOTT <!-- Tell your future self why have you made these changes --> - Users that could using trampolining in the near future could run into a problem, which shall only surface if the workflow author does not have AutoUpgrade as the initial CAN behaviour for their Pinned workflows. As of today, if a user does hit this case, they would have the pinned workflow CAN'ing infinite number of times. This is because a pinned workflow would commence on the pinned version (since it would have inherited pinned version set to true) and if the pinned version is different than a worker-deployment's current version, the targetDeploymentVersionChanged would be set to true in the server. This would then result in the workflow just CAN'ing all the time! - This change aims to change that by only allowing at-most one CAN for such pinned workflow executions, and it does so by remembering what the initial version was at the start of the continue-as-new. <!-- Are there any breaking changes on binary or code level? --> **Breaking changes** <!-- If this breaks the Server, please provide the Server PR to merge right after this PR was merged. --> - [PR!](temporalio/temporal#9374) --------- Co-authored-by: Claude Opus 4.6 <[email protected]>
|
Semgrep found 2 No explicit |
…ralio#9374) ## What changed? - Consists of the change to prevent any Pinned workflows, that may have forgotten to have the initial CAN Behaviour as AU, from CAN'ing infinitely. - Also allows trampolining of a Pinned workflow onto the Unversioned if un-versioned is the current version of the worker deployment at that point in time. Note, the effective behaviour of the workflow would later then be unversioned. ## Why? - Worker-Versioning correctness. ## How did you test it? - [x] built - [x] run locally and tested manually - [x] covered by existing tests - [ ] added new unit test(s) - [x] added new functional test(s) ## Potential risks - Pre-existing workflows (started before this fix) have nil TargetVersionOnStart. On the first WFT after deployment, "" != "build-v2" → spurious targetDeploymentVersionChanged=true. This is a one-time false-positive regression for those Pinned workflows. <!-- CURSOR_SUMMARY --> --- > [!NOTE] > **Medium Risk** > Touches generated protobuf API surface used by internal services; while the change is additive, mismatched proto versions across services could cause integration/compatibility issues. > > **Overview** > Adds `declined_target_version_upgrade` to `StartWorkflowExecutionRequest` (HistoryService API) so continue-as-new/retry chains can carry forward the SDK-declined target deployment version and avoid pinned-workflow trampolining loops. > > Regenerates protobuf Go bindings, updating import/type references across `request_response.pb.go`, and adds missing `Marshal`/`Unmarshal`/`Size`/`Equal` helpers for the persistence `LastNotifiedTargetVersion` message. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 0af06d0. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Co-authored-by: Claude Opus 4.6 <[email protected]>
What changed?
Why?
How did you test it?
Potential risks
"build-v2" → spurious targetDeploymentVersionChanged=true. This is a one-time false-positive regression for those Pinned workflows.
Note
Medium Risk
Touches generated protobuf APIs used across internal services; while changes are mostly additive/regeneration, wire/type mismatches or partial rollouts could cause compatibility issues.
Overview
Adds a new
declined_target_version_upgradefield toStartWorkflowExecutionRequest(historyservice API) so continue-as-new/retry runs can persist the prior run’s implicitly-declined target deployment version.Regenerates protobuf Go bindings, updating import aliases/types across
request_response.pb.go, and adds missingMarshal/Unmarshal/Size/Equalhelpers for persistence messageLastNotifiedTargetVersion.Written by Cursor Bugbot for commit 7e039eb. This will update automatically on new commits. Configure here.