KEP-4205: Graduate Expose PSI Metrics to GA#5605
KEP-4205: Graduate Expose PSI Metrics to GA#5605k8s-ci-robot merged 6 commits intokubernetes:masterfrom
Conversation
|
Welcome @mariafromano-25! |
|
Hi @mariafromano-25. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/ok-to-test |
1 similar comment
|
/ok-to-test |
|
Two things:
|
| NOTE: Also set `disable-supported` to `true` or `false` in `kep.yaml`. | ||
| --> | ||
| Yes | ||
| Yes, but starting in v1.35 where this feature graduates to GA, the KubeletPSI feature gate will be locked to true and will no longer be disable-able. |
| milestone: | ||
| alpha: "v1.33" | ||
| beta: "v1.34" | ||
| stable: "v1.36" |
There was a problem hiding this comment.
I wonder if we should leave the milestone unchanged. We can update it during the 1.36 cycle. When we update we should confirm that all GA criteria are met.
Either way we will need to update the latest-milestone: "v1.35" line when 1.36 comes.
There was a problem hiding this comment.
updated to 1.36 for both!
for latest-milestone, the comment above mentioned that
# The most recent milestone for which work toward delivery of this KEP has been
# done. This can be the current (upcoming) milestone, if it is being actively
# worked on
and since Im actively working on it now, I updated it to 1.36
| #### GA | ||
| - Gather evidence of real world usage. | ||
| - No major issue reported. | ||
| - Quantify the cAdvisor and kubelet-level overhead of PSI metric collection, especially where PSI is disabled at the kernel level. |
There was a problem hiding this comment.
Let's also ensure we cover stress testing scenario. We may need to expand on top of the previous performance benchmarking
|
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
243b641 to
15dfa9e
Compare
|
cc: @ndixita |
| - No major issue reported. | ||
| - Quantify the cAdvisor and kubelet-level overhead of PSI metric collection, especially where PSI is disabled at the kernel level. | ||
| - Validate with SIG Node that collection overhead is acceptable for general use cases, or include opt-out knobs. | ||
| - Exoanded stress testing with diverse environments and scenarios, while maintining acceptable minimal resource consumption like outlined in Beta perf testing. |
There was a problem hiding this comment.
Just checking if we have a list of different environments and scenarios that we plan to add tests for, we could document those as well.
| NOTE: Also set `disable-supported` to `true` or `false` in `kep.yaml`. | ||
| --> | ||
| Yes | ||
| Yes, but starting in v1.36 where this feature graduates to GA, the KubeletPSI feature gate will be locked to true and will no longer be disable-able. |
There was a problem hiding this comment.
nit: Yes, but starting in v1.36 where this feature graduates to GA, the KubeletPSI feature gate will be locked to true and can no longer be disabled.
|
Should we investigate this before GA? |
this was talked through on the bug and I believe not. This can be solved today with prometheus level metric filtering, and in the future CRI stats can choose with more granularity to expose these metrics on pod / infra container level or not. metrics are expensive, and that expense is generally considered worth it. |
My main concern will be on regressions for those that don't implement this. Will we document or support a knob to filter out these pods? Or we just accept the performance hit and wait to solve it for a future KEP? |
haircommander
left a comment
There was a problem hiding this comment.
on that line, I am feeling ready to approve this PR. I can imagine cases where a user would want to enable PSI but disable the metrics (because of the cost), but we don't currently allow users to disable any default container metrics in cadvisor. As we move to CRI stats, that opens the avenue for more granularity and ability to customize metrics exposed. Thus, the aforementioned cases can be handled by that, rather than introducing new kubelet config fields to do so. If customers want to enable PSI, they get these metrics. if they don't want the metrics, don't enable PSI. There's room to improve in the future.
I am curious if we feel we have enough real world data yet, but that will be teased out in the process of actually bumping to stable. let's make sure users are using it and are happy.
/approve
from SIG node side.
Thanks for the update @mariafromano-25 !
| - Quantify the cAdvisor and kubelet-level overhead of PSI metric collection, especially where PSI is disabled at the kernel level. | ||
| - Validate with SIG Node that collection overhead is acceptable for general use cases, or include opt-out knobs. | ||
| - Expanded stress testing with diverse environments and scenarios, while maintining acceptable minimal resource consumption like outlined in Beta perf testing. | ||
| - Gather evidence of real-world usage from beta users. |
There was a problem hiding this comment.
Do we have any data on this yet? Has Google enabled? Openshift allows customers to enable it but has not turned it on by default.
There was a problem hiding this comment.
Google enables it by default, and there is no way for the user to turn it off at the moment.
Grepping for CONFIG_PSI in /boot/CONFIG-FILE
CONFIG_PSI=y
# CONFIG_PSI_DEFAULT_DISABLED is not set
The custom node system configurations documentation does not mention it either. But the beta performance test report indicated negligible overhead on both the node and kubelet level. I am working on more performance tests to also include the kernel.
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: haircommander, johnbelamaric, mariafromano-25 The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This PR updates KEP-4205 to reflect the graduation of the "Expose PSI Metrics" feature to General Availability (GA), targeting the v1.35 release.
The KEP
stagehas been updated tostableand thelatest-milestoneis nowv1.35./kind documentation
/sig node
/assign @roycaihw