Fix zero PSI metrics emitted when OS doesn't enable PSI#137326
Conversation
|
This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
Hi @amritansh1502. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Regular contributors should join the org to skip this step. Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
e20d7fa to
ddac88a
Compare
|
/ok-to-test |
ddac88a to
a45b0a4
Compare
a45b0a4 to
9a05dc9
Compare
| // the host. PSI is a single kernel feature (CONFIG_PSI / boot param "psi=") | ||
| // that exposes /proc/pressure/{cpu,memory,io} atomically — checking the | ||
| // /proc/pressure directory is sufficient to determine support for all three. | ||
| func IsPsiEnabled(ctx context.Context) bool { |
There was a problem hiding this comment.
is kubelet the right place to have this logic in? Will it be better to have this detection be read from the same place cAdvisor reads actual values?
There was a problem hiding this comment.
Yeah reading from the same place cadvisor reads actual values is feasible , Currently i am checking /proc/pressure + /proc/cmdline , but cadvisor actually reads PSI from per-cgroup pressure files via the opencontainers/cgroups library, and its statPSI() already returns nil when those files don't exist or the kernel returns ENOTSUP. I will change the detection to stat the root cgroup's file instead .
There was a problem hiding this comment.
is there a reference to opencontainers/cgroups already? Can we call some library method instead of re-implementing it and trying to keep in-sync long term?
There was a problem hiding this comment.
thanks, yes opencontainers/cgroups is already used in pkg/kubelet/cm/, we can use cgroups.Openfile and cgroupfs2.UnifiedMountpoint
|
|
||
| if utilfeature.DefaultFeatureGate.Enabled(features.KubeletPSI) { | ||
| includedMetrics[cadvisormetrics.PressureMetrics] = struct{}{} | ||
| ctx := context.Background() |
There was a problem hiding this comment.
nit, it may be best to allow to pass logger into New method and use it everywhere. We may decide to have a context in future in those operations, but since it is mostly file system operations, I do not expect it happen very soon.
There was a problem hiding this comment.
will fix , i will update New() to accept a klog.Logger instead of creating a context.Background() internally.
0db3e81 to
87767f1
Compare
87767f1 to
f69fa3c
Compare
Made-with: Cursor
f69fa3c to
fbb6896
Compare
| return isPsiEnabled(logger, cgroupfs2.UnifiedMountpoint, "cpu.pressure") | ||
| } | ||
|
|
||
| func isPsiEnabled(logger klog.Logger, cgroupDir, psiFile string) bool { |
There was a problem hiding this comment.
this way of checking is lgtm. Ideal for me would be if this would have happened in cAdvisor and it will not return zeroes for something that doesn't exist.
There was a problem hiding this comment.
thanks, agreed, ideally cAdvisor itself would not emit zeros when PSI data is nil ,the statPSI() nil gets flattened to a zero-valued struct in cAdvisor setPSIStats before reaching the prometheus . I will file an upstream issue on google/cadvisor to track that as a follow-up.
There was a problem hiding this comment.
+1. Having cAdvisor handle that will be cleaner and more consistent. This PR LGTM as to mitigate the issue. We can migrate when cAdvisor is ready.
SergeyKanzhelev
left a comment
There was a problem hiding this comment.
@ndixita any thoughts on this PR?
|
hi @SergeyKanzhelev, thank you for the review feedback , I have addressed all your |
|
Thanks for your work on this! |
|
LGTM label has been added. DetailsGit tree hash: 34a843ecde4d5ef5bd502b2e397240a063277cb0 |
|
/test pull-kubernetes-unit-windows-master |
| return isPsiEnabled(logger, cgroupfs2.UnifiedMountpoint, "cpu.pressure") | ||
| } | ||
|
|
||
| func isPsiEnabled(logger klog.Logger, cgroupDir, psiFile string) bool { |
There was a problem hiding this comment.
+1. Having cAdvisor handle that will be cleaner and more consistent. This PR LGTM as to mitigate the issue. We can migrate when cAdvisor is ready.
| if IsPsiEnabled(logger) { | ||
| includedMetrics[cadvisormetrics.PressureMetrics] = struct{}{} | ||
| } else { | ||
| logger.Info("PSI support not available") |
There was a problem hiding this comment.
Instead of Info, should this be Warning? It seems to be a misconfiguration to me if KubeletPSI is enabled but kernel doesn't support it.
|
/lgtm let's merge this and follow up on cAdvisor fix |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: amritansh1502, SergeyKanzhelev The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@amritansh1502: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
/retest-required |
What type of PR is this?
/kind bug
What this PR does / why we need it:
When the
KubeletPSIfeature gate is enabled but the OS does not support PSI (either the kernel lacksCONFIG_PSIor was booted withpsi=0), the kubelet could registerPressureMetricswith cAdvisor. This causes zero-valued PSI metrics to be emitted, which is confusing for monitoring and alerting.The previous
IsPsiEnabled()implementation checked/proc/pressureand parsed/proc/cmdlineforpsi=0. This was unreliable because/proc/pressurecan exist even when per-cgroup PSI files — which cAdvisor actually reads — are absent ( on cgroup v1 systems).This PR simplifies
IsPsiEnabled()to open/sys/fs/cgroup/cpu.pressure— the same type of per-cgroup file that cAdvisor reads PSI values from. This single check is sufficient because:CONFIG_PSI/ boot parampsi=) so checkingcpu.pressuredetermines support for all three resources (cpu, memory, io)./proc/cmdlineparsing is no longer needed since the cgroup file's absence already covers thepsi=0case.Note:
PSI is a single kernel feature (
CONFIG_PSI/ boot parampsi=) that exposes/sys/fs/cgroup/cpu.pressureatomically — checking onlys sys/fs/cgroup/cpu.pressure is sufficient to determine support for all three resource types.The check is used in both
cadvisor.New()andServer.InstallAuthNotRequiredHandlers()so that pressure metricsare only collected when PSI is actually supported by the OS.
Screenshots:
After making the code changes , i have build it and copy kubelet to the local kind cluster and tested for both psi =0 and psi =1, here are the results.

Before (zero PSI metrics emitted):
After (PSI metrics skipped when unsupported):

Which issue(s) this PR is related to:
Fixes #136333
Special notes for your reviewer:
psi=0via GRUB).Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: