Skip to content

KEP-4205: Graduate Expose PSI Metrics to GA#5605

Merged
k8s-ci-robot merged 6 commits intokubernetes:masterfrom
mariafromano-25:update-kep-4205-to-ga
Feb 9, 2026
Merged

KEP-4205: Graduate Expose PSI Metrics to GA#5605
k8s-ci-robot merged 6 commits intokubernetes:masterfrom
mariafromano-25:update-kep-4205-to-ga

Conversation

@mariafromano-25
Copy link
Copy Markdown
Contributor

@mariafromano-25 mariafromano-25 commented Oct 2, 2025

  • One-line PR description:
    This PR updates KEP-4205 to reflect the graduation of the "Expose PSI Metrics" feature to General Availability (GA), targeting the v1.35 release.
    The KEP stage has been updated to stable and the latest-milestone is now v1.35.
  • Other comments:
    /kind documentation
    /sig node
    /assign @roycaihw

@k8s-ci-robot k8s-ci-robot added kind/documentation Categorizes issue or PR as related to documentation. sig/node Categorizes an issue or PR as relevant to SIG Node. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Oct 2, 2025
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Welcome @mariafromano-25!

It looks like this is your first PR to kubernetes/enhancements 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/enhancements has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory label Oct 2, 2025
@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Oct 2, 2025
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Hi @mariafromano-25. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Oct 2, 2025
@kannon92
Copy link
Copy Markdown
Contributor

kannon92 commented Oct 3, 2025

/ok-to-test

1 similar comment
@kannon92
Copy link
Copy Markdown
Contributor

kannon92 commented Oct 3, 2025

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Oct 3, 2025
@kannon92
Copy link
Copy Markdown
Contributor

kannon92 commented Oct 3, 2025

Two things:

  1. Please add a prod-readiness file to request a PRR review for this KEP.

  2. We normally avoid promoting features to GA if there are test failures

@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Oct 7, 2025
Comment thread keps/sig-node/4205-psi-metric/README.md Outdated
NOTE: Also set `disable-supported` to `true` or `false` in `kep.yaml`.
-->
Yes
Yes, but starting in v1.35 where this feature graduates to GA, the KubeletPSI feature gate will be locked to true and will no longer be disable-able.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/1.35/1.36

milestone:
alpha: "v1.33"
beta: "v1.34"
stable: "v1.36"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should leave the milestone unchanged. We can update it during the 1.36 cycle. When we update we should confirm that all GA criteria are met.

Either way we will need to update the latest-milestone: "v1.35" line when 1.36 comes.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated to 1.36 for both!
for latest-milestone, the comment above mentioned that

# The most recent milestone for which work toward delivery of this KEP has been
# done. This can be the current (upcoming) milestone, if it is being actively
# worked on

and since Im actively working on it now, I updated it to 1.36

#### GA
- Gather evidence of real world usage.
- No major issue reported.
- Quantify the cAdvisor and kubelet-level overhead of PSI metric collection, especially where PSI is disabled at the kernel level.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also ensure we cover stress testing scenario. We may need to expand on top of the previous performance benchmarking

@k8s-triage-robot
Copy link
Copy Markdown

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 5, 2026
@mariafromano-25
Copy link
Copy Markdown
Contributor Author

cc: @ndixita

@pacoxu pacoxu mentioned this pull request Jan 27, 2026
34 tasks
Comment thread keps/sig-node/4205-psi-metric/README.md Outdated
Comment thread keps/sig-node/4205-psi-metric/README.md Outdated
- No major issue reported.
- Quantify the cAdvisor and kubelet-level overhead of PSI metric collection, especially where PSI is disabled at the kernel level.
- Validate with SIG Node that collection overhead is acceptable for general use cases, or include opt-out knobs.
- Exoanded stress testing with diverse environments and scenarios, while maintining acceptable minimal resource consumption like outlined in Beta perf testing.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checking if we have a list of different environments and scenarios that we plan to add tests for, we could document those as well.

Comment thread keps/sig-node/4205-psi-metric/README.md Outdated
NOTE: Also set `disable-supported` to `true` or `false` in `kep.yaml`.
-->
Yes
Yes, but starting in v1.36 where this feature graduates to GA, the KubeletPSI feature gate will be locked to true and will no longer be disable-able.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Yes, but starting in v1.36 where this feature graduates to GA, the KubeletPSI feature gate will be locked to true and can no longer be disabled.

@kannon92
Copy link
Copy Markdown
Contributor

#4205 (comment)

Should we investigate this before GA?

@haircommander
Copy link
Copy Markdown
Contributor

#4205 (comment)

Should we investigate this before GA?

this was talked through on the bug and I believe not. This can be solved today with prometheus level metric filtering, and in the future CRI stats can choose with more granularity to expose these metrics on pod / infra container level or not. metrics are expensive, and that expense is generally considered worth it.

@kannon92
Copy link
Copy Markdown
Contributor

kannon92 commented Feb 9, 2026

this was talked through on the bug and I believe not. This can be solved today with prometheus level metric filtering, and in the future CRI stats can choose with more granularity to expose these metrics on pod / infra container level or not. metrics are expensive, and that expense is generally considered worth it.

My main concern will be on regressions for those that don't implement this. Will we document or support a knob to filter out these pods? Or we just accept the performance hit and wait to solve it for a future KEP?

Copy link
Copy Markdown
Contributor

@haircommander haircommander left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on that line, I am feeling ready to approve this PR. I can imagine cases where a user would want to enable PSI but disable the metrics (because of the cost), but we don't currently allow users to disable any default container metrics in cadvisor. As we move to CRI stats, that opens the avenue for more granularity and ability to customize metrics exposed. Thus, the aforementioned cases can be handled by that, rather than introducing new kubelet config fields to do so. If customers want to enable PSI, they get these metrics. if they don't want the metrics, don't enable PSI. There's room to improve in the future.

I am curious if we feel we have enough real world data yet, but that will be teased out in the process of actually bumping to stable. let's make sure users are using it and are happy.

/approve

from SIG node side.

Thanks for the update @mariafromano-25 !

- Quantify the cAdvisor and kubelet-level overhead of PSI metric collection, especially where PSI is disabled at the kernel level.
- Validate with SIG Node that collection overhead is acceptable for general use cases, or include opt-out knobs.
- Expanded stress testing with diverse environments and scenarios, while maintining acceptable minimal resource consumption like outlined in Beta perf testing.
- Gather evidence of real-world usage from beta users.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any data on this yet? Has Google enabled? Openshift allows customers to enable it but has not turned it on by default.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Google enables it by default, and there is no way for the user to turn it off at the moment.

Grepping for CONFIG_PSI in /boot/CONFIG-FILE
CONFIG_PSI=y
# CONFIG_PSI_DEFAULT_DISABLED is not set

The custom node system configurations documentation does not mention it either. But the beta performance test report indicated negligible overhead on both the node and kubelet level. I am working on more performance tests to also include the kernel.

@johnbelamaric
Copy link
Copy Markdown
Member

/approve
/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 9, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: haircommander, johnbelamaric, mariafromano-25

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 9, 2026
@k8s-ci-robot k8s-ci-robot merged commit bf71aa0 into kubernetes:master Feb 9, 2026
4 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.36 milestone Feb 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/documentation Categorizes issue or PR as related to documentation. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. sig/node Categorizes an issue or PR as relevant to SIG Node. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants