Skip to content

Adding metrics for Maxunavailable feature in StatefulSet#130951

Merged
k8s-ci-robot merged 14 commits intokubernetes:masterfrom
Edwinhr716:maxunavailable_metrics
Sep 17, 2025
Merged

Adding metrics for Maxunavailable feature in StatefulSet#130951
k8s-ci-robot merged 14 commits intokubernetes:masterfrom
Edwinhr716:maxunavailable_metrics

Conversation

@Edwinhr716
Copy link
Copy Markdown
Contributor

@Edwinhr716 Edwinhr716 commented Mar 20, 2025

What type of PR is this?

/kind feature

What this PR does / why we need it:

Adds a metric to track how many times there has been a maxunavailable violation, requirement for kubernetes/enhancements#961 beta graduation.

Which issue(s) this PR fixes:

Part of kubernetes/enhancements#961

Special notes for your reviewer:

This is a follow up to the discussion on the KEP update PR kubernetes/enhancements#4474 (comment).

General consensus seems to be that this metric should be in tree instead of in kube-state-metrics.

Open question:

  • Should the metric be generic like the one exposed by deployment?

cc @atiratree @dgrisonnet @wojtek-t who were part of the original discussion.

Does this PR introduce a user-facing change?

Adds metric for Maxunavailable feature

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Mar 20, 2025
@k8s-ci-robot k8s-ci-robot added sig/apps Categorizes an issue or PR as relevant to SIG Apps. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Mar 20, 2025
@github-project-automation github-project-automation Bot moved this to Needs Triage in SIG Apps Mar 20, 2025
@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Mar 20, 2025
@Edwinhr716
Copy link
Copy Markdown
Contributor Author

/assign @janetkuo @soltysh

@k8s-ci-robot k8s-ci-robot added area/stable-metrics Issues or PRs involving stable metrics sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. labels Mar 20, 2025
@janetkuo
Copy link
Copy Markdown
Member

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 20, 2025
Comment thread pkg/controller/statefulset/stateful_set_control.go Outdated
@Edwinhr716
Copy link
Copy Markdown
Contributor Author

/retest

@k8s-triage-robot
Copy link
Copy Markdown

This PR may require stable metrics review.

Stable metrics are guaranteed to not change. Please review the documentation for the requirements and lifecycle of stable metrics and ensure that your metrics meet these guidelines.

@dims
Copy link
Copy Markdown
Member

dims commented Mar 24, 2025

cc @xiaohongchen1991

Comment thread pkg/controller/statefulset/stateful_set_control.go Outdated
Comment thread pkg/controller/statefulset/metrics/metrics.go Outdated
Comment thread test/instrumentation/testdata/stable-metrics-list.yaml Outdated
@janetkuo
Copy link
Copy Markdown
Member

LGTM in general after the presubmit check failure is fixed. @soltysh would you like to take a look as well?

return err
}
}
metrics.MaxUnavailable.WithLabelValues(set.Namespace, set.Name, podManagementPolicy).Set(float64(maxUnavailable))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given this functionality is tied to MaxUnavailableStatefulSet feature gate, I believe we should wrap this entire block in if utilfeature.DefaultFeatureGate.Enabled(features.MaxUnavailableStatefulSet) { block.

@@ -1013,7 +1015,7 @@ func TestStatefulSetControlRollingUpdateWithMaxUnavailable(t *testing.T) {
// if pod 4 ready, start to update pod 3, even though 5 is not ready
spc.setPodRunning(set, 4)
spc.setPodRunning(set, 5)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you're fixing it, drop this line. You're setting pod 5 running further below, so this line is confusing.

}
if len(pods) != totalPods {
t.Fatalf("Expected create pods 2/3, got pods %v", pods)
// In OrderedReady mode, only 5 pods exist at this point (pod 5 not created yet)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// In OrderedReady mode, only 5 pods exist at this point (pod 5 not created yet)
// In OrderedReady mode, only 4 pods exist at this point (pod 5 not created yet)

spc.setPodRunning(set, 5)
spc.setPodReady(set, 5)
originalPods, _ = spc.setPodReady(set, 3)
originalPods, _ = spc.setPodReadyCondition(set, 3, true)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something is off here, the comment claims pods 3,4,5 are ready, but we're only setting 3 & 5 running, and only 3 ready, why?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the comment is correct. The test sets the pod 4 in running state in line 1058.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you're saying, we're setting the pod running in both simpleOrderedVerificationFn and simpleParallelVerificationFn, which on one hand is somewhat misguiding. It seems like this particular test case deserves a re-write, but that's pre-existing, so definitely not part of this PR.

@helayoty helayoty force-pushed the maxunavailable_metrics branch from c2f7485 to 7630021 Compare September 16, 2025 23:12
@helayoty helayoty force-pushed the maxunavailable_metrics branch from 7630021 to b9aa7c2 Compare September 16, 2025 23:50
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

@Edwinhr716: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kubernetes-linter-hints b9aa7c2 link false /test pull-kubernetes-linter-hints

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Copy link
Copy Markdown
Contributor

@soltysh soltysh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

spc.setPodRunning(set, 5)
spc.setPodReady(set, 5)
originalPods, _ = spc.setPodReady(set, 3)
originalPods, _ = spc.setPodReadyCondition(set, 3, true)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you're saying, we're setting the pod running in both simpleOrderedVerificationFn and simpleParallelVerificationFn, which on one hand is somewhat misguiding. It seems like this particular test case deserves a re-write, but that's pre-existing, so definitely not part of this PR.

expectedUnavailableReplicasValue int
}

testFn := func(test *testcase, t *testing.T) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pattern, where we define check function beforehand, even though it could easily be part of the invocation down below is confusing. But that's not blocking this particular PR.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't prefer it either, but wanted to follow the same pattern in other tests in this file.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 17, 2025
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

LGTM label has been added.

DetailsGit tree hash: 9df13ab1ec357cf3f7b86497d127faaf6f4787bd

@soltysh
Copy link
Copy Markdown
Contributor

soltysh commented Sep 17, 2025

/label tide/merge-method-squash

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Edwinhr716, hashim21223445, janetkuo, soltysh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Sep 17, 2025
@k8s-ci-robot k8s-ci-robot merged commit fa90713 into kubernetes:master Sep 17, 2025
12 of 13 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.35 milestone Sep 17, 2025
@github-project-automation github-project-automation Bot moved this from In Progress to Done in SIG Apps Sep 17, 2025
@github-project-automation github-project-automation Bot moved this from In Progress to Done in SIG Instrumentation Sep 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/stable-metrics Issues or PRs involving stable metrics cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Projects

Archived in project
Archived in project

Development

Successfully merging this pull request may close these issues.