Skip to content

Add DrainageStatus#525

Merged
carlydf merged 9 commits intoversioning-3.1from
cdf/drainage
Jan 28, 2025
Merged

Add DrainageStatus#525
carlydf merged 9 commits intoversioning-3.1from
cdf/drainage

Conversation

@carlydf
Copy link
Copy Markdown
Contributor

@carlydf carlydf commented Jan 22, 2025

READ BEFORE MERGING: All PRs require approval by both Server AND SDK teams before merging! This is why the number of required approvals is "2" and not "1"--two reviewers from the same team is NOT sufficient. If your PR is not approved by someone in BOTH teams, it may be summarily reverted.

Add DrainageStatus

Why?

Breaking changes

Server PR

@carlydf carlydf requested review from a team as code owners January 22, 2025 07:09
Comment on lines +159 to +161
// This means if the Version is "drained" but new workflows are sent to it via
// Pinned Versioning Override, the status does not account for those Pinned-override
// executions and remains "drained".
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a bug? If I override and start sending workflows I would expect to be draining until these workflows finish, so that I don't remove the workers...

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only a temporary limitation. We'll improve it later, but don't want to spend time on it while working on P0 items because this edge case is not expected to be frequent.

(Note that the limitation is only when the version is already drained and THEN use starts new wfs on it with pinned override. If it's not yet drained and user starts wfs with pinned override, the status will count those wfs.)

Copy link
Copy Markdown
Contributor

@antlai-temporal antlai-temporal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before we had a separate API because of the cost of doing reachability. Are we assuming that this is no longer a problem because we added caching with minutes of refresh delay? Is rate limiting no longer needed?


// Specify the drainage state for a Worker Deployment Version so users can decide whether they
// can safely decommission the version.
enum VersionDrainageState {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know the linter we use suggests naming enums as xyzState and not xyzStatus, but it seems to me "status" would fit here better because it conveys that this is a read-only value that is only reported to user, as opposed to a state that can be mutable by user directly.

@ShahabT
Copy link
Copy Markdown
Contributor

ShahabT commented Jan 22, 2025

Before we had a separate API because of the cost of doing reachability. Are we assuming that this is no longer a problem because we added caching with minutes of refresh delay? Is rate limiting no longer needed?

@antlai-temporal, correct. We wont rate limit DescribeWorkerDeploymentVersion aggressively anymore, we'd just return the latest drainage info. It's server's business how often it'll refresh them (within reasonable intervals), but the last refresh timestamp is given to user.

Comment on lines +143 to +146
// - This is the Current Version of the Deployment.
// - This is the Ramping Version of the Deployment.
// - This is Unversioned and the Deployment does not have a Current
// Deployment.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the latest UI work, it seems useful to be able to differentiate between these from the Version POV. So I'm considering to have the following enum instead of the bool:

enum VersionRoutingStatus {
  UNSPECIFIED,
  // No routing status, hence new executions are not routed to this Version (unless they are explicitly pinned to this version).
  NONE,
  // This is the Current Version of the Deployment, hence it receives new executions.
  CURRENT,
  // This is the Ramping Version of the Deployment, hence it receives a share of new executions.
  RAMPING,
}

(maybe we'll add more routing statuses particularly for unversioned mode later when we add support for them, but right now we want to keep those parts out of the API because they are not implemented)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the framework of State = writeable / Status = readable consequence of stuff, would this be better named as VersionRoutingState?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm... its sort of in the grey area because "current_version" is an state from deployment POV but a version being current or not is a status from Version POV.

But now thinking more, maybe we don't need this enum. Because we'd have to have timestamps such as current_since and ramping_since for each version and if they have value it means the version is current or ramping.

}

// Describes the routing state that the user has put this version in.
enum VersionRoutingState {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there may be some controversy around this based on previous design discussions. So, let's not add it just yet. Instead the two timestamps of current_since and ramping_since should be enough for now. Once unversioned workers are supported we'd have a better idea about these.

Comment on lines +138 to +143
google.protobuf.Timestamp current_since_time = 4;

// (-- api-linter: core::0140::prepositions=disabled
// aip.dev/not-precedent: 'Since' captures the field semantics despite being a preposition. --)
// Nil if not ramping.
google.protobuf.Timestamp ramping_since_time = 5;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry for the late review but how does the thought of encapsulating this under a message sound? Both these time stamps are dealing with routing information - looking below, the timestamps are under a single message

wdyt @carlydf ?

Comment on lines +135 to +142
// Last time `is_current` or `ramp_percentage` of this version changed.
google.protobuf.Timestamp routing_update_time = 4;

// If this version is the current version of its deployment.
bool is_current = 5;

// Range: [0, 100]. Must be zero if is_current is true. Must be non-zero if version is ramping.
float ramp_percentage = 6;
Copy link
Copy Markdown
Contributor

@ShahabT ShahabT Jan 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the following reasons, I kinda prefer the previous format which had two separate timestamps better (but yeah, we should add ramp_percentage too):

  • Based on @antlai-temporal 's feedback, we're going to support ramping version with 0% ramp. This format does not allow distinguishing that from "not ramping version".
  • In the UI, we plan to tell user "ramping since x" where x is the time from start of the ramp, not the last ramp percentage change. (that info seems less valuable from user's POV)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so I guess what this would also mean is that we need server to have these separate timestamps present (maybe instead of routingUpdateTime) to keep things less-confusing

cc - @carlydf

@carlydf carlydf merged commit 7430e24 into versioning-3.1 Jan 28, 2025
@carlydf carlydf deleted the cdf/drainage branch January 28, 2025 20:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants