Worker Deployment Version Drainage Status#7158
Conversation
| if args.GetCurrentSinceTime().AsTime().Equal(d.GetVersionState().GetCurrentSinceTime().AsTime()) && | ||
| args.GetRampingSinceTime().AsTime().Equal(d.GetVersionState().GetRampingSinceTime().AsTime()) { | ||
| res := &deploymentspb.SyncVersionStateResponse{VersionState: d.VersionState} | ||
| return temporal.NewApplicationError("no change", errNoChangeType, res) |
There was a problem hiding this comment.
thinking out loud: I believe the idea is to use the function handleSyncState when we want to notify changes related to both ramping as well as current right?
I am having a hard time convincing myself that this validator is doing the right thing, now that we are trying to add ramping - in my head, the two operations are to a certain extent mutually exclusive. If a version x is already current, and the user calls SetCurrent on it again, shouldn't we be evaluating only whether GetCurrentSinceTime is equal to the local state? Why should we check the ramping time?
I know we are going to set the value of ramping_since_time when setCurrent is called too, but how are we going to pass in the same value stored inside d.GetVersionState().GetRampingSinceTime() from SetCurrent?
There was a problem hiding this comment.
I think when SetCurrent is called, ramping_since_time will be set to nil, and we want to sync both values. Similarly, if SetRamp is called, current_since_time will be set to nil, which we want to sync across the workflows/task queues that need to know that.
There was a problem hiding this comment.
I changed this to make more sense
There was a problem hiding this comment.
I like the change more - no need to evaluate the timing since current was set just check if it was current/ramping
| // (-- api-linter: core::0140::prepositions=disabled | ||
| // aip.dev/not-precedent: 'Since' captures the field semantics despite being a preposition. --) | ||
| // Nil if not ramping. | ||
| google.protobuf.Timestamp ramping_since_time = 3; |
There was a problem hiding this comment.
thinking out loud: if we are planning on using version_workflow.handleSyncState to also set a version to ramping, this should have a ramp percentage set right?
There was a problem hiding this comment.
yes, I think so! I was thinking to wait to add that until the ramp PR, but we could pre-emptively add the field now. I'll at least add a todo
| WorkerDeploymentWorkflowType = "temporal-sys-worker-deployment-workflow" | ||
| WorkerDeploymentVersionWorkflowType = "temporal-sys-worker-deployment-version-workflow" | ||
| WorkerDeploymentWorkflowType = "temporal-sys-worker-deployment-workflow" | ||
| WorkerDeploymentCheckDrainageWorkflowType = "temporal-sys-worker-deployment-check-drainage-workflow" |
There was a problem hiding this comment.
temporal-sys-worker-deployment-check-drainage-workflow -> temporal-sys-worker-deployment-drainage-workflow
no need to keep it super verbose; wdyt?
| if wasAcceptingNewWorkflows && !isAcceptingNewWorkflows { | ||
| state.DrainageInfo = &deploymentpb.VersionDrainageInfo{} | ||
| workflow.ExecuteChildWorkflow(ctx, WorkerDeploymentCheckDrainageWorkflowType, d.a.namespace) | ||
| } |
There was a problem hiding this comment.
say we have a version workflow that has become draining and kickstarts this workflow - now, say this version workflow CAN's - I might have missed this but I don't see a PARENT_CLOSE_POLICY for this child workflow
IMHO, we should:
- set the policy to terminate so that the workflow terminates when we CAN
- In the
version_workflow.runfunction, if the drainage state is set todraining, we simply kickstart this child workflow again which shall continue execution
This will also call for the todo statement you have inside of drainage_workflow.go to be implemented since if this version workflow were to accept executions, we would have to send a signal to this workflow to stop carrying out execution
wdyt?
There was a problem hiding this comment.
yes, I need to add the code handling CaN of the parent and also handling termination of the child workflow if the parent version starts accepting traffic again
| DeploymentName: d.VersionState.DeploymentName, | ||
| Version: d.VersionState.Version, | ||
| CurrentSinceTime: d.VersionState.CurrentSinceTime, | ||
| RampingSinceTime: d.VersionState.RampingSinceTime, | ||
| CreateTime: d.VersionState.CreateTime, | ||
| DrainageInfo: d.VersionState.DrainageInfo, |
There was a problem hiding this comment.
Is there point altering the workflow memo of the version workflow to store this? IMO, the 3.1 API's don't require querying visibility to receive information from version workflows directly. We only require these version workflows return their local state during DescribeVersion calls.
I think we should alter the worker-deployment workflow memo to have these additional fields you have added since the deployment workflow will now be the source-of-truth for things like "current-version", "ramping-version", etc. (RoutingInfo, basically)
wdyt?
There was a problem hiding this comment.
After our discussion on what the memo is used for (to give info via DescribeWorkflowExecution and ListWorkflowExecutions) I removed everything here, because DescribeVersion uses query, not memo. I have a hunch that we might find it useful to put drainage info in here at some point, but we should only add that when we have a use-case for it
| google.protobuf.Timestamp routing_update_time = 4; | ||
| // If this version is the current version of its deployment. | ||
| bool is_current = 5; | ||
| // Range: [0, 100]. Must be zero if is_current is true. Must be non-zero if `version.version` |
There was a problem hiding this comment.
may have to change this comment since we don't have a version.version (or we may have in the near future based on Shahab's latest API refinements)
| go s.pollFromDeployment(ctx, tv2) | ||
| s.EventuallyWithT(func(t *assert.CollectT) { | ||
| a := assert.New(t) | ||
| resp, err := s.FrontendClient().DescribeWorkerDeploymentVersion(ctx, &workflowservice.DescribeWorkerDeploymentVersionRequest{ | ||
| Namespace: s.Namespace().String(), | ||
| Version: tv2.BuildID(), | ||
| }) | ||
| a.NoError(err) | ||
| a.Equal(tv2.DeploymentSeries(), resp.GetWorkerDeploymentVersionInfo().GetDeploymentName()) | ||
| a.Equal(tv2.BuildID(), resp.GetWorkerDeploymentVersionInfo().GetVersion()) | ||
| }, time.Second*5, time.Millisecond*200) |
There was a problem hiding this comment.
this may not be what you are intending to do - I assume you wanted to start a new version workflow, under the same deployment name, right? However, both tv1 and tv2 are using tv.BuildID() which generates the same BuildID making only one version workflow registered
IMHO, maybe try using .WithBuildIDNumber(number) for both these separate tvs?
There was a problem hiding this comment.
I think since I initiated them with this, now the build ids are different whenever I do tv.BuildId(). Thanks for catching this!
tv1 := testvars.New(s).WithBuildIDNumber(1)
tv2 := testvars.New(s).WithBuildIDNumber(2)
…ynamic config, and test them
| ) | ||
| } | ||
|
|
||
| func (s *DeploymentVersionSuite) TestDrainageStatus_SetCurrentVersion_NoOpenWFs() { |
There was a problem hiding this comment.
This tests the new parts of the feature (version stops accepting new executions --> child workflow starts --> child sets the status to DRAINING and sleeps for visibilityGracePeriod --> child queries visibility, counts 0 open workflows, and sets the status to DRAINED).
It does not test the visibility query, since there are no open workflows, but the visibility query code has not changed from the pre-release, so that should all still work. There are TODOs to test it in the comments below.
Shivs11
left a comment
There was a problem hiding this comment.
haven't yet peeked at the functional tests but the code looks nice - thanks for the annoying refactorings required with the latest API refinements too - I shall take a peek at the tests sometime later today but merging to iterate faster
| message SyncVersionStateUpdateArgs { | ||
| // Last time `current_since_time`, `ramping_since_time, or `ramp_percentage` of this version changed. | ||
| google.protobuf.Timestamp routing_update_time = 4; | ||
|
|
||
| message SetCurrent { | ||
| // If last_became_current_time is present, then set the deployment's | ||
| // last_became_current_time to it and set is_current true. If it's missing, | ||
| // set is_current false. | ||
| google.protobuf.Timestamp last_became_current_time = 1; | ||
| } | ||
| // (-- api-linter: core::0140::prepositions=disabled | ||
| // aip.dev/not-precedent: 'Since' captures the field semantics despite being a preposition. --) | ||
| // Nil if not current. | ||
| google.protobuf.Timestamp current_since_time = 5; | ||
|
|
||
| // (-- api-linter: core::0140::prepositions=disabled | ||
| // aip.dev/not-precedent: 'Since' captures the field semantics despite being a preposition. --) | ||
| // Nil if not ramping. Updated when the version first starts ramping, not on each ramp change. | ||
| google.protobuf.Timestamp ramping_since_time = 6; | ||
|
|
||
| // Range: [0, 100]. Must be zero if the version is not ramping (i.e. `ramping_since_time` is nil). | ||
| // Can be in the range [0, 100] if the version is ramping. | ||
| float ramp_percentage = 7; |
There was a problem hiding this comment.
nit: these new fields are nice but they can be numbered from 1 since we haven't released this feature yet to customers and changes will be backwards compatible
| if info.Status == enumspb.VERSION_DRAINAGE_STATUS_DRAINED { | ||
| return nil | ||
| } | ||
| workflow.Sleep(ctx, refreshInterval) |
There was a problem hiding this comment.
this whole new file is really nice and the workflow makes sense - I wonder, in the future certainly not right now, if we want to implement a backoff strategy when the workflow sleeps
just some food for thought
| }, | ||
| WorkflowVersioningMode: 0, // todo | ||
| CreateTime: now, | ||
| RoutingUpdateTime: now, |
There was a problem hiding this comment.
hmm, I think I might have realized there might be a lag in our implementation: Assume the following flow:
RegisterTaskQueueWorker -> updateWithStartWorkerDeploymentVersion -> handleRegisterWorker
In the piece of code highlighted, we update the routingUpdateTime to be now which is right. This shall make the local state of the version wf have the routingUpdateTime to also be now.
Now, when the update is going to take place, in handleRegisterWorker, we update the routingUpdateTime for the task-queue to be registered by again doing time.Now(). This may cause a lag between the local state's update time and the time being sent to the task-queue.
In handleRegisterWorker:236
// initial data
data := &deploymentspb.DeploymentVersionData{
Version: d.VersionState.Version,
RoutingUpdateTime: timestamppb.Now(),
CurrentSinceTime: nil, // not current
RampingSinceTime: nil, // not ramping
RampPercentage: 0, // not ramping
FirstPollerTime: args.FirstPollerTime,
}
again, not a big deal given that routing will work - but just some food for thought if we should take the extra step and achieve no lag
There was a problem hiding this comment.
hm yeah I think it would be better to pass the original routing time along
|
merging this to speed up development - more tests to follow |
## What changed? Add DrainageStatus child workflow to worker deployment system ## Why? To periodically update the version workflow with results from visibility. ## How did you test it? Functional tests. Currently broken (see todo comment in code) ## Potential risks <!-- Assuming the worst case, what can be broken when deploying this change to production? --> ## Documentation <!-- Have you made sure this change doesn't falsify anything currently stated in `docs/`? If significant new behavior is added, have you described that in `docs/`? --> ## Is hotfix candidate? <!-- Is this PR a hotfix candidate or does it require a notification to be sent to the broader community? (Yes/No) --> --------- Co-authored-by: Shivam Saraf <[email protected]>
## What changed? Add DrainageStatus child workflow to worker deployment system ## Why? To periodically update the version workflow with results from visibility. ## How did you test it? Functional tests. Currently broken (see todo comment in code) ## Potential risks <!-- Assuming the worst case, what can be broken when deploying this change to production? --> ## Documentation <!-- Have you made sure this change doesn't falsify anything currently stated in `docs/`? If significant new behavior is added, have you described that in `docs/`? --> ## Is hotfix candidate? <!-- Is this PR a hotfix candidate or does it require a notification to be sent to the broader community? (Yes/No) --> --------- Co-authored-by: Shivam Saraf <[email protected]>
## What changed? Add DrainageStatus child workflow to worker deployment system ## Why? To periodically update the version workflow with results from visibility. ## How did you test it? Functional tests. Currently broken (see todo comment in code) ## Potential risks <!-- Assuming the worst case, what can be broken when deploying this change to production? --> ## Documentation <!-- Have you made sure this change doesn't falsify anything currently stated in `docs/`? If significant new behavior is added, have you described that in `docs/`? --> ## Is hotfix candidate? <!-- Is this PR a hotfix candidate or does it require a notification to be sent to the broader community? (Yes/No) --> --------- Co-authored-by: Shivam Saraf <[email protected]>
## What changed? Add DrainageStatus child workflow to worker deployment system ## Why? To periodically update the version workflow with results from visibility. ## How did you test it? Functional tests. Currently broken (see todo comment in code) ## Potential risks <!-- Assuming the worst case, what can be broken when deploying this change to production? --> ## Documentation <!-- Have you made sure this change doesn't falsify anything currently stated in `docs/`? If significant new behavior is added, have you described that in `docs/`? --> ## Is hotfix candidate? <!-- Is this PR a hotfix candidate or does it require a notification to be sent to the broader community? (Yes/No) --> --------- Co-authored-by: Shivam Saraf <[email protected]>
## What changed? Add DrainageStatus child workflow to worker deployment system ## Why? To periodically update the version workflow with results from visibility. ## How did you test it? Functional tests. Currently broken (see todo comment in code) ## Potential risks <!-- Assuming the worst case, what can be broken when deploying this change to production? --> ## Documentation <!-- Have you made sure this change doesn't falsify anything currently stated in `docs/`? If significant new behavior is added, have you described that in `docs/`? --> ## Is hotfix candidate? <!-- Is this PR a hotfix candidate or does it require a notification to be sent to the broader community? (Yes/No) --> --------- Co-authored-by: Shivam Saraf <[email protected]>
## What changed? Add DrainageStatus child workflow to worker deployment system ## Why? To periodically update the version workflow with results from visibility. ## How did you test it? Functional tests. Currently broken (see todo comment in code) ## Potential risks <!-- Assuming the worst case, what can be broken when deploying this change to production? --> ## Documentation <!-- Have you made sure this change doesn't falsify anything currently stated in `docs/`? If significant new behavior is added, have you described that in `docs/`? --> ## Is hotfix candidate? <!-- Is this PR a hotfix candidate or does it require a notification to be sent to the broader community? (Yes/No) --> --------- Co-authored-by: Shivam Saraf <[email protected]>
## What changed? Add DrainageStatus child workflow to worker deployment system ## Why? To periodically update the version workflow with results from visibility. ## How did you test it? Functional tests. Currently broken (see todo comment in code) ## Potential risks <!-- Assuming the worst case, what can be broken when deploying this change to production? --> ## Documentation <!-- Have you made sure this change doesn't falsify anything currently stated in `docs/`? If significant new behavior is added, have you described that in `docs/`? --> ## Is hotfix candidate? <!-- Is this PR a hotfix candidate or does it require a notification to be sent to the broader community? (Yes/No) --> --------- Co-authored-by: Shivam Saraf <[email protected]>
What changed?
Add DrainageStatus child workflow to worker deployment system
Why?
To periodically update the version workflow with results from visibility.
How did you test it?
Functional tests. Currently broken (see todo comment in code)
Potential risks
Documentation
Is hotfix candidate?