[KEP 4205] Blog post for PSI metrics GA#54709
Conversation
✅ Pull request preview available for checkingBuilt without sensitive environment variables
To edit notification comments on pull requests, go to your Netlify project configuration. |
|
Hi @mariafromano-25 👋 v1.36 Communications team here, @ttsuuubasa as author of #54541, I'd like you to be a writing buddy for @mariafromano-25 on this PR. Please:
|
|
Hi @mariafromano-25 👋 -- this is Utkarsh (@UtkarshUmre ) from the v1.36 Communications Team! Just a friendly reminder that we are approaching the feature blog "ready for review" deadline: Monday, 6 April. We ask you to have the blog in non-draft state, and all write-up to be complete, so that we can start the blog review from SIG Docs Blog team. If you have any questions or need help, please don't hesitate to reach out to me or any of the Communications Team members. We are here to help you! |
7bf0428 to
ffa5893
Compare
ffa5893 to
1fae618
Compare
lmktfy
left a comment
There was a problem hiding this comment.
This is basically OK but I do have a lot of small recommended fixes.
|
|
||
| Since its original implementation in the Linux kernel in 2018, Pressure Stall Information (PSI) has provided users with the high-fidelity signals needed to identify resource saturation before it becomes an outage. Unlike traditional utilization metrics, PSI tells the story of tasks stalled and time lost, all in nicely-packaged percentages of time across the CPU, memory, and I/O. | ||
|
|
||
| Today, we are excited to announce that Kubelet-integrated PSI metrics have graduated to **General Availability (GA)** in Kubernetes v1.36. This graduation ensures that users across the ecosystem have a stable, reliable interface to observe resource contention at the node, pod, and container levels. |
There was a problem hiding this comment.
This is a post release blog article. It is not the announcement; it is a detailed follow up. Please reword accordingly.
There was a problem hiding this comment.
Thank you for the callout, I reworded to reference the recent release instead of announcing.
|
|
||
| You can read more about our performance tests [here](https://docs.google.com/document/d/1ffv7pleid3uk0dT9euK5vkso41i9tscNoEVgeshfiCc/edit?usp=sharing&resourcekey=0-_xKPCU4zqiGU0e7Q8jDJRw). | ||
|
|
||
| ### Getting Started |
There was a problem hiding this comment.
Do you need to use a particular kernel or does it Just Work?
There was a problem hiding this comment.
Yes, in the "Getting Started" section I specified
| 1. **Kernel PSI OFF / Kubelet Feature ON** (Baseline) | ||
| 2. **Kernel PSI ON / Kubelet Feature ON** (Kernel Scheduler overhead) | ||
| 3. **Kernel PSI ON / Kubelet Feature OFF** (Default Baseline) | ||
| 4. **Kernel PSI ON / Kubelet Feature ON** (Feature fallback behavior) |
There was a problem hiding this comment.
nit: Why is this called "fallback"? Isn't kernal ON + kubelet ON = the feature?
There was a problem hiding this comment.
I was trying to list out the total conditions but I can see how it was confusing from the overlap. I reworded to clearly outline both cases, one to isolate kernel overhead and the other for kubelet overhead.
| {{< figure src="/images/node_sys_cpu_usage_rate_comparison.png" alt="A line graph comparing the Node System (Kernel) CPU usage rate over elapsed time with the PSI feature turned OFF versus ON." title="Node System CPU Usage Rate Comparison" >}} | ||
| *Figure 1: Node System CPU comparison under load (80 pods).* | ||
|
|
||
| As seen in Figure 1, the "Kernel Tax" for enabling PSI is remarkably low. Even under heavy I/O and CPU load, the **System CPU** delta between the PSI-enabled (red) and PSI-disabled (blue) clusters remained consistently under **0.2 cores** and over **0.037 cores** for the most part. This confirms that simply enabling the feature does not raise the pre-existing resource use and that the internal kernel bookkeeping for stall tracking is safe for production-scale deployments. |
There was a problem hiding this comment.
Reading this, it not super clear to me what two scenarios that we are comparing.
the "Kernel Tax" for enabling PSI is remarkably low
Does "enabling PSI" mean the Kubernetes feature? And is the kernel PSI enabled or disabled?
There was a problem hiding this comment.
You're right, it was confusing reading back. I added explanations for both cases with their corresponding graphs
|
|
||
| As seen in Figure 1, the "Kernel Tax" for enabling PSI is remarkably low. Even under heavy I/O and CPU load, the **System CPU** delta between the PSI-enabled (red) and PSI-disabled (blue) clusters remained consistently under **0.2 cores** and over **0.037 cores** for the most part. This confirms that simply enabling the feature does not raise the pre-existing resource use and that the internal kernel bookkeeping for stall tracking is safe for production-scale deployments. | ||
|
|
||
| {{< figure src="/images/kubelet_cpu_usage_rate_comprison.png" alt="A line graph comparing the Kubelet CPU usage rate over elapsed time with the PSI feature turned OFF versus ON." title="Kubelet CPU Usage Rate Comparison" >}} |
There was a problem hiding this comment.
I think this should be the main measurement that we are demonstrating-- kubelet overhead when the k8s feature is enabled v.s. disabled (while PSI is enabled at kernel level).
| You can read more about our performance tests [here](https://docs.google.com/document/d/1ffv7pleid3uk0dT9euK5vkso41i9tscNoEVgeshfiCc/edit?usp=sharing&resourcekey=0-_xKPCU4zqiGU0e7Q8jDJRw). | ||
|
|
||
| ### Getting Started | ||
| As of v1.36, the `KubeletPSI` feature gate is enabled by default. You can query the Kubelet Summary API to see real-time pressure data: |
There was a problem hiding this comment.
We should document the requirements:
- Kernel version
- cgroup v2
|
|
||
| You can read more about our performance tests [here](https://docs.google.com/document/d/1ffv7pleid3uk0dT9euK5vkso41i9tscNoEVgeshfiCc/edit?usp=sharing&resourcekey=0-_xKPCU4zqiGU0e7Q8jDJRw). | ||
|
|
||
| ### Getting Started |
There was a problem hiding this comment.
Can we also mention this improvement that was done in 1.36? kubernetes/kubernetes#137326
There was a problem hiding this comment.
Great idea! Just added that as well before the "Getting Started" section.
ttsuuubasa
left a comment
There was a problem hiding this comment.
Everything looks fine to me on my end. Congrats on GA!
| To use PSI metrics in your Kubernetes cluster, your nodes must meet the following requirements: | ||
|
|
||
| 1. **Ensure your nodes are running a Linux kernel version 4.20 or later and are using cgroup v2.** | ||
| 2. **Ensure PSI is enabled at the OS level** (your kernel must be compiled with `CONFIG_PSI=y` and must not be booted with the `psi=0` parameter). |
There was a problem hiding this comment.
must not be booted with the
psi=0parameter
What's the default? If the default is 1, this looks good. If the default it 0, shall we just say must boot with psi=1 instead?
There was a problem hiding this comment.
yup default is 1!
| Our testing focused on two primary scenarios to isolate the impact of the Kubelet and kernel-level collection respectively: | ||
| 1. **Kernel PSI ON / Kubelet Feature OFF** vs **Kernel PSI ON / Kubelet Feature ON** (Kubelet overhead) | ||
| 2. **Kernel PSI OFF / Kubelet Feature ON** vs **Kernel PSI ON / Kubelet Feature ON** (Kernel overhead) | ||
|
|
||
| #### Scenario 1: The Kubelet Overhead | ||
| First, we evaluated the Kubelet overhead (Case 1) on 4 core machines. For these tests, the Linux kernel was already tracking pressure on both clusters by default(`psi=1`), but we toggled the `KubeletPSI` feature gate to see if the Kubelet actively querying and exposing these metrics impacted the resource usage. As seen in the following graph, the **System CPU** usage lines for the Kubelet PSI-enabled (red) follows the same pattern as the Kubelet PSI-disabled (blue) clusters, with a slight expected increase and delay from the baseline. This visualizes that once the OS is tracking PSI, at around **2.5 cores**, the act of Kubernetes reading those cgroup metrics is negligible to performance. | ||
|
|
||
| {{< figure src="/images/kubeletPSI_sys_cpu_usage_rate_graph.png" alt="A line graph comparing the system CPU usage rate over elapsed time with the PSI feature turned off versus on and kernel PSI off." title="(Case 1) System CPU Usage Rate Comparison" caption="Figure 1: Node System CPU Usage Rate Comparison." >}} |
There was a problem hiding this comment.
You mentioned case 1 is "Kernel PSI ON / Kubelet Feature OFF vs Kernel PSI ON / Kubelet Feature ON", so I expected kernel PSI to always be ON in the comparison. I don't get why the graph is showing (kernel) PSI ON v.s. OFF. Did I miss something?
There was a problem hiding this comment.
You're correct, Kernal PSI stays ON for case 1. The "PSI On/Off" is referring to the Kubelet feature gate. OH i think I put "off" instead of "on" on the alt field
There was a problem hiding this comment.
Just updated the alt field and the graph line labels to be clearer
|
LGTM label has been added. DetailsGit tree hash: 0d73854c29d35c9aabec61610dbc564535c22d03 |
|
/assign @nate-double-u |
|
@ttsuuubasa you were suggested as the writing buddy for this PR. Would you be willing to provide a review? |
lmktfy
left a comment
There was a problem hiding this comment.
/lgtm
/approve
/hold
for release comms to confirm we can include this one
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: lmktfy, roycaihw The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/unhold |
Description
This commit adds a new blog post to announce that Pressure Stall Information (PSI) Metrics has graduated to Stable (GA) in Kubernetes v1.36.
Issue
Ref: kubernetes/enhancements#4205
/sig node
Closes: #