Skip to content

Fix zero PSI metrics emitted when OS doesn't enable PSI#137326

Merged
k8s-ci-robot merged 1 commit intokubernetes:masterfrom
amritansh1502:fix-136333-kubelet-zero-psi-metrics
Mar 18, 2026
Merged

Fix zero PSI metrics emitted when OS doesn't enable PSI#137326
k8s-ci-robot merged 1 commit intokubernetes:masterfrom
amritansh1502:fix-136333-kubelet-zero-psi-metrics

Conversation

@amritansh1502
Copy link
Copy Markdown
Contributor

@amritansh1502 amritansh1502 commented Mar 1, 2026

What type of PR is this?

/kind bug

What this PR does / why we need it:

When the KubeletPSI feature gate is enabled but the OS does not support PSI (either the kernel lacks CONFIG_PSI or was booted with psi=0), the kubelet could register PressureMetrics with cAdvisor. This causes zero-valued PSI metrics to be emitted, which is confusing for monitoring and alerting.

The previous IsPsiEnabled() implementation checked /proc/pressure and parsed /proc/cmdline for psi=0. This was unreliable because /proc/pressure can exist even when per-cgroup PSI files — which cAdvisor actually reads — are absent ( on cgroup v1 systems).

This PR simplifies IsPsiEnabled() to open /sys/fs/cgroup/cpu.pressure — the same type of per-cgroup file that cAdvisor reads PSI values from. This single check is sufficient because:

  • PSI is a single kernel feature (CONFIG_PSI / boot param psi=) so checking cpu.pressure determines support for all three resources (cpu, memory, io).
  • On cgroup v1 systems (where cAdvisor cannot read per-cgroup PSI anyway), this file does not exist, so we correctly skip PSI metrics.
  • The /proc/cmdline parsing is no longer needed since the cgroup file's absence already covers the psi=0 case.

Note:

PSI is a single kernel feature (CONFIG_PSI / boot param psi=) that exposes /sys/fs/cgroup/cpu.pressure atomically — checking onlys sys/fs/cgroup/cpu.pressure is sufficient to determine support for all three resource types.

The check is used in both cadvisor.New() and Server.InstallAuthNotRequiredHandlers() so that pressure metrics
are only collected when PSI is actually supported by the OS.

Screenshots:

After making the code changes , i have build it and copy kubelet to the local kind cluster and tested for both psi =0 and psi =1, here are the results.
Before (zero PSI metrics emitted):
Screenshot From 2026-03-01 16-45-59

After (PSI metrics skipped when unsupported):
Screenshot From 2026-03-01 16-37-20

Which issue(s) this PR is related to:

Fixes #136333

Special notes for your reviewer:

  • Unit tests cover both cases (file present and file absent) and all pass.
  • Manually tested on a kind cluster with PSI enabled and PSI disabled (psi=0 via GRUB).

Does this PR introduce a user-facing change?

Fixed an issue where zero-valued PSI (Pressure Stall Information) metrics were emitted by the kubelet when the OS does not support PSI, even if the KubeletPSI feature gate was enabled.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

N/A

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Mar 1, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Mar 1, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Hi @amritansh1502. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Mar 1, 2026
@k8s-ci-robot k8s-ci-robot added area/kubelet sig/node Categorizes an issue or PR as relevant to SIG Node. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Mar 1, 2026
@amritansh1502 amritansh1502 force-pushed the fix-136333-kubelet-zero-psi-metrics branch from e20d7fa to ddac88a Compare March 1, 2026 18:16
@amritansh1502 amritansh1502 marked this pull request as ready for review March 1, 2026 18:25
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 1, 2026
@kannon92
Copy link
Copy Markdown
Contributor

kannon92 commented Mar 1, 2026

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Mar 1, 2026
@amritansh1502 amritansh1502 force-pushed the fix-136333-kubelet-zero-psi-metrics branch from ddac88a to a45b0a4 Compare March 1, 2026 18:49
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Mar 1, 2026
@amritansh1502 amritansh1502 force-pushed the fix-136333-kubelet-zero-psi-metrics branch from a45b0a4 to 9a05dc9 Compare March 1, 2026 18:55
Comment thread pkg/kubelet/cadvisor/cadvisor_linux.go Outdated
// the host. PSI is a single kernel feature (CONFIG_PSI / boot param "psi=")
// that exposes /proc/pressure/{cpu,memory,io} atomically — checking the
// /proc/pressure directory is sufficient to determine support for all three.
func IsPsiEnabled(ctx context.Context) bool {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is kubelet the right place to have this logic in? Will it be better to have this detection be read from the same place cAdvisor reads actual values?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah reading from the same place cadvisor reads actual values is feasible , Currently i am checking /proc/pressure + /proc/cmdline , but cadvisor actually reads PSI from per-cgroup pressure files via the opencontainers/cgroups library, and its statPSI() already returns nil when those files don't exist or the kernel returns ENOTSUP. I will change the detection to stat the root cgroup's file instead .

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reference to opencontainers/cgroups already? Can we call some library method instead of re-implementing it and trying to keep in-sync long term?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, yes opencontainers/cgroups is already used in pkg/kubelet/cm/, we can use cgroups.Openfile and cgroupfs2.UnifiedMountpoint

Comment thread pkg/kubelet/cadvisor/cadvisor_linux.go Outdated

if utilfeature.DefaultFeatureGate.Enabled(features.KubeletPSI) {
includedMetrics[cadvisormetrics.PressureMetrics] = struct{}{}
ctx := context.Background()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, it may be best to allow to pass logger into New method and use it everywhere. We may decide to have a context in future in those operations, but since it is mostly file system operations, I do not expect it happen very soon.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will fix , i will update New() to accept a klog.Logger instead of creating a context.Background() internally.

@amritansh1502 amritansh1502 force-pushed the fix-136333-kubelet-zero-psi-metrics branch 2 times, most recently from 0db3e81 to 87767f1 Compare March 12, 2026 14:16
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Mar 12, 2026
@amritansh1502 amritansh1502 force-pushed the fix-136333-kubelet-zero-psi-metrics branch from 87767f1 to f69fa3c Compare March 12, 2026 16:20
@amritansh1502 amritansh1502 force-pushed the fix-136333-kubelet-zero-psi-metrics branch from f69fa3c to fbb6896 Compare March 12, 2026 18:54
return isPsiEnabled(logger, cgroupfs2.UnifiedMountpoint, "cpu.pressure")
}

func isPsiEnabled(logger klog.Logger, cgroupDir, psiFile string) bool {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this way of checking is lgtm. Ideal for me would be if this would have happened in cAdvisor and it will not return zeroes for something that doesn't exist.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, agreed, ideally cAdvisor itself would not emit zeros when PSI data is nil ,the statPSI() nil gets flattened to a zero-valued struct in cAdvisor setPSIStats before reaching the prometheus . I will file an upstream issue on google/cadvisor to track that as a follow-up.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. Having cAdvisor handle that will be cleaner and more consistent. This PR LGTM as to mitigate the issue. We can migrate when cAdvisor is ready.

Copy link
Copy Markdown
Member

@SergeyKanzhelev SergeyKanzhelev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ndixita any thoughts on this PR?

@amritansh1502
Copy link
Copy Markdown
Contributor Author

amritansh1502 commented Mar 17, 2026

hi @SergeyKanzhelev, thank you for the review feedback , I have addressed all your
comments in the latest push (switched to opencontainers/cgroups, passing logger
instead of context). Could you lgtm when you get a chance?
With Code Freeze coming up tomorrow, I would also appreciate an approver review.
@yujuhong would you be able to take a look?

cc @ndixita @dchen1107

@mariafromano-25
Copy link
Copy Markdown
Contributor

Thanks for your work on this!
/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 17, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

LGTM label has been added.

DetailsGit tree hash: 34a843ecde4d5ef5bd502b2e397240a063277cb0

@mariafromano-25
Copy link
Copy Markdown
Contributor

/test pull-kubernetes-unit-windows-master

return isPsiEnabled(logger, cgroupfs2.UnifiedMountpoint, "cpu.pressure")
}

func isPsiEnabled(logger klog.Logger, cgroupDir, psiFile string) bool {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. Having cAdvisor handle that will be cleaner and more consistent. This PR LGTM as to mitigate the issue. We can migrate when cAdvisor is ready.

if IsPsiEnabled(logger) {
includedMetrics[cadvisormetrics.PressureMetrics] = struct{}{}
} else {
logger.Info("PSI support not available")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of Info, should this be Warning? It seems to be a misconfiguration to me if KubeletPSI is enabled but kernel doesn't support it.

@SergeyKanzhelev
Copy link
Copy Markdown
Member

/lgtm
/approve
/skip

let's merge this and follow up on cAdvisor fix

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: amritansh1502, SergeyKanzhelev

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 17, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

@amritansh1502: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kubernetes-e2e-capz-windows 756de57 link false /test pull-kubernetes-e2e-capz-windows
pull-kubernetes-unit-windows-master fbb6896 link false /test pull-kubernetes-unit-windows-master
pull-kubernetes-e2e-gce fbb6896 link unknown /test pull-kubernetes-e2e-gce

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@SergeyKanzhelev
Copy link
Copy Markdown
Member

/retest-required

@k8s-ci-robot k8s-ci-robot merged commit ad4e232 into kubernetes:master Mar 18, 2026
13 of 14 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.36 milestone Mar 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubelet cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/node Categorizes an issue or PR as relevant to SIG Node. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Development

Successfully merging this pull request may close these issues.

Zero value Kubelet PSI metrics emitted even if underlying OS doesn't enable it

6 participants