Skip to content

[KEP 4205] Blog post for PSI metrics GA#54709

Merged
k8s-ci-robot merged 8 commits intokubernetes:mainfrom
mariafromano-25:psi-ga-blog
Apr 27, 2026
Merged

[KEP 4205] Blog post for PSI metrics GA#54709
k8s-ci-robot merged 8 commits intokubernetes:mainfrom
mariafromano-25:psi-ga-blog

Conversation

@mariafromano-25
Copy link
Copy Markdown
Contributor

Description

This commit adds a new blog post to announce that Pressure Stall Information (PSI) Metrics has graduated to Stable (GA) in Kubernetes v1.36.

Issue

Ref: kubernetes/enhancements#4205

/sig node
Closes: #

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. sig/node Categorizes an issue or PR as relevant to SIG Node. labels Feb 27, 2026
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Feb 27, 2026
@k8s-ci-robot k8s-ci-robot added area/blog Issues or PRs related to the Kubernetes Blog subproject language/en Issues or PRs related to English language size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Feb 27, 2026
@netlify
Copy link
Copy Markdown

netlify Bot commented Feb 27, 2026

Pull request preview available for checking

Built without sensitive environment variables

Name Link
🔨 Latest commit 1a5659f
🔍 Latest deploy log https://app.netlify.com/projects/kubernetes-io-main-staging/deploys/69e7f705fd47920008b63f26
😎 Deploy Preview https://deploy-preview-54709--kubernetes-io-main-staging.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@chadmcrowell
Copy link
Copy Markdown
Contributor

Hi @mariafromano-25 👋 v1.36 Communications team here,

@ttsuuubasa as author of #54541, I'd like you to be a writing buddy for @mariafromano-25 on this PR.

Please:

  • Review this PR, paying attention to the guidelines and review hints
  • Update your own PR based on any best practices you identify that should be applied
  • Remember to be compassionate with your fellow article author

@UtkarshUmre
Copy link
Copy Markdown
Member

Hi @mariafromano-25 👋 -- this is Utkarsh (@UtkarshUmre ) from the v1.36 Communications Team!

Just a friendly reminder that we are approaching the feature blog "ready for review" deadline: Monday, 6 April. We ask you to have the blog in non-draft state, and all write-up to be complete, so that we can start the blog review from SIG Docs Blog team.

If you have any questions or need help, please don't hesitate to reach out to me or any of the Communications Team members. We are here to help you!

@k8s-ci-robot k8s-ci-robot added sig/docs Categorizes an issue or PR as relevant to SIG Docs. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Apr 7, 2026
@mariafromano-25
Copy link
Copy Markdown
Contributor Author

cc @roycaihw @ndixita

@mariafromano-25 mariafromano-25 marked this pull request as ready for review April 7, 2026 06:55
@k8s-ci-robot k8s-ci-robot requested a review from graz-dev April 7, 2026 06:56
@mariafromano-25 mariafromano-25 changed the title WIP: Add placeholder blog post for PSI metrics GA [KEP 4205] Blog post for PSI metrics GA Apr 7, 2026
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 7, 2026
Copy link
Copy Markdown
Member

@lmktfy lmktfy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is basically OK but I do have a lot of small recommended fixes.

Comment thread content/en/blog/_posts/2026-04-22-psi-metrics-ga/index.md Outdated

Since its original implementation in the Linux kernel in 2018, Pressure Stall Information (PSI) has provided users with the high-fidelity signals needed to identify resource saturation before it becomes an outage. Unlike traditional utilization metrics, PSI tells the story of tasks stalled and time lost, all in nicely-packaged percentages of time across the CPU, memory, and I/O.

Today, we are excited to announce that Kubelet-integrated PSI metrics have graduated to **General Availability (GA)** in Kubernetes v1.36. This graduation ensures that users across the ecosystem have a stable, reliable interface to observe resource contention at the node, pod, and container levels.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a post release blog article. It is not the announcement; it is a detailed follow up. Please reword accordingly.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the callout, I reworded to reference the recent release instead of announcing.

Comment thread content/en/blog/_posts/2026-04-22-psi-metrics-ga/index.md Outdated
Comment thread content/en/blog/_posts/2026-04-22-psi-metrics-ga/index.md Outdated
Comment thread content/en/blog/_posts/2026-04-22-psi-metrics-ga/index.md Outdated
Comment thread content/en/blog/_posts/2026-04-22-psi-metrics-ga/index.md Outdated
Comment thread content/en/blog/_posts/2026-04-22-psi-metrics-ga/index.md Outdated
Comment thread content/en/blog/_posts/2026-04-22-psi-metrics-ga/index.md Outdated

You can read more about our performance tests [here](https://docs.google.com/document/d/1ffv7pleid3uk0dT9euK5vkso41i9tscNoEVgeshfiCc/edit?usp=sharing&resourcekey=0-_xKPCU4zqiGU0e7Q8jDJRw).

### Getting Started
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need to use a particular kernel or does it Just Work?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, in the "Getting Started" section I specified

Comment thread content/en/blog/_posts/2026-04-22-psi-metrics-ga/index.md Outdated
1. **Kernel PSI OFF / Kubelet Feature ON** (Baseline)
2. **Kernel PSI ON / Kubelet Feature ON** (Kernel Scheduler overhead)
3. **Kernel PSI ON / Kubelet Feature OFF** (Default Baseline)
4. **Kernel PSI ON / Kubelet Feature ON** (Feature fallback behavior)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Why is this called "fallback"? Isn't kernal ON + kubelet ON = the feature?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to list out the total conditions but I can see how it was confusing from the overlap. I reworded to clearly outline both cases, one to isolate kernel overhead and the other for kubelet overhead.

{{< figure src="/images/node_sys_cpu_usage_rate_comparison.png" alt="A line graph comparing the Node System (Kernel) CPU usage rate over elapsed time with the PSI feature turned OFF versus ON." title="Node System CPU Usage Rate Comparison" >}}
*Figure 1: Node System CPU comparison under load (80 pods).*

As seen in Figure 1, the "Kernel Tax" for enabling PSI is remarkably low. Even under heavy I/O and CPU load, the **System CPU** delta between the PSI-enabled (red) and PSI-disabled (blue) clusters remained consistently under **0.2 cores** and over **0.037 cores** for the most part. This confirms that simply enabling the feature does not raise the pre-existing resource use and that the internal kernel bookkeeping for stall tracking is safe for production-scale deployments.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading this, it not super clear to me what two scenarios that we are comparing.

the "Kernel Tax" for enabling PSI is remarkably low

Does "enabling PSI" mean the Kubernetes feature? And is the kernel PSI enabled or disabled?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, it was confusing reading back. I added explanations for both cases with their corresponding graphs


As seen in Figure 1, the "Kernel Tax" for enabling PSI is remarkably low. Even under heavy I/O and CPU load, the **System CPU** delta between the PSI-enabled (red) and PSI-disabled (blue) clusters remained consistently under **0.2 cores** and over **0.037 cores** for the most part. This confirms that simply enabling the feature does not raise the pre-existing resource use and that the internal kernel bookkeeping for stall tracking is safe for production-scale deployments.

{{< figure src="/images/kubelet_cpu_usage_rate_comprison.png" alt="A line graph comparing the Kubelet CPU usage rate over elapsed time with the PSI feature turned OFF versus ON." title="Kubelet CPU Usage Rate Comparison" >}}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be the main measurement that we are demonstrating-- kubelet overhead when the k8s feature is enabled v.s. disabled (while PSI is enabled at kernel level).

You can read more about our performance tests [here](https://docs.google.com/document/d/1ffv7pleid3uk0dT9euK5vkso41i9tscNoEVgeshfiCc/edit?usp=sharing&resourcekey=0-_xKPCU4zqiGU0e7Q8jDJRw).

### Getting Started
As of v1.36, the `KubeletPSI` feature gate is enabled by default. You can query the Kubelet Summary API to see real-time pressure data:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


You can read more about our performance tests [here](https://docs.google.com/document/d/1ffv7pleid3uk0dT9euK5vkso41i9tscNoEVgeshfiCc/edit?usp=sharing&resourcekey=0-_xKPCU4zqiGU0e7Q8jDJRw).

### Getting Started
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also mention this improvement that was done in 1.36? kubernetes/kubernetes#137326

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great idea! Just added that as well before the "Getting Started" section.

Copy link
Copy Markdown
Contributor

@ttsuuubasa ttsuuubasa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything looks fine to me on my end. Congrats on GA!

Comment thread content/en/blog/_posts/2026-04-22-psi-metrics-ga/index.md Outdated
Comment thread content/en/blog/_posts/2026-04-22-psi-metrics-ga/index.md Outdated
Comment thread content/en/blog/_posts/2026-04-22-psi-metrics-ga/index.md Outdated
To use PSI metrics in your Kubernetes cluster, your nodes must meet the following requirements:

1. **Ensure your nodes are running a Linux kernel version 4.20 or later and are using cgroup v2.**
2. **Ensure PSI is enabled at the OS level** (your kernel must be compiled with `CONFIG_PSI=y` and must not be booted with the `psi=0` parameter).
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

must not be booted with the psi=0 parameter

What's the default? If the default is 1, this looks good. If the default it 0, shall we just say must boot with psi=1 instead?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup default is 1!

Comment on lines +27 to +34
Our testing focused on two primary scenarios to isolate the impact of the Kubelet and kernel-level collection respectively:
1. **Kernel PSI ON / Kubelet Feature OFF** vs **Kernel PSI ON / Kubelet Feature ON** (Kubelet overhead)
2. **Kernel PSI OFF / Kubelet Feature ON** vs **Kernel PSI ON / Kubelet Feature ON** (Kernel overhead)

#### Scenario 1: The Kubelet Overhead
First, we evaluated the Kubelet overhead (Case 1) on 4 core machines. For these tests, the Linux kernel was already tracking pressure on both clusters by default(`psi=1`), but we toggled the `KubeletPSI` feature gate to see if the Kubelet actively querying and exposing these metrics impacted the resource usage. As seen in the following graph, the **System CPU** usage lines for the Kubelet PSI-enabled (red) follows the same pattern as the Kubelet PSI-disabled (blue) clusters, with a slight expected increase and delay from the baseline. This visualizes that once the OS is tracking PSI, at around **2.5 cores**, the act of Kubernetes reading those cgroup metrics is negligible to performance.

{{< figure src="/images/kubeletPSI_sys_cpu_usage_rate_graph.png" alt="A line graph comparing the system CPU usage rate over elapsed time with the PSI feature turned off versus on and kernel PSI off." title="(Case 1) System CPU Usage Rate Comparison" caption="Figure 1: Node System CPU Usage Rate Comparison." >}}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mentioned case 1 is "Kernel PSI ON / Kubelet Feature OFF vs Kernel PSI ON / Kubelet Feature ON", so I expected kernel PSI to always be ON in the comparison. I don't get why the graph is showing (kernel) PSI ON v.s. OFF. Did I miss something?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're correct, Kernal PSI stays ON for case 1. The "PSI On/Off" is referring to the Kubelet feature gate. OH i think I put "off" instead of "on" on the alt field

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just updated the alt field and the graph line labels to be clearer

Copy link
Copy Markdown
Member

@roycaihw roycaihw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve
Thanks!

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 21, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

LGTM label has been added.

DetailsGit tree hash: 0d73854c29d35c9aabec61610dbc564535c22d03

@chadmcrowell
Copy link
Copy Markdown
Contributor

/assign @nate-double-u

@lmktfy
Copy link
Copy Markdown
Member

lmktfy commented Apr 23, 2026

@ttsuuubasa you were suggested as the writing buddy for this PR.

Would you be willing to provide a review?

Copy link
Copy Markdown
Member

@lmktfy lmktfy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

/hold
for release comms to confirm we can include this one

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 26, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: lmktfy, roycaihw

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 26, 2026
@SwathiR03
Copy link
Copy Markdown
Contributor

/unhold

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 27, 2026
@k8s-ci-robot k8s-ci-robot merged commit 8f896ea into kubernetes:main Apr 27, 2026
6 checks passed
@github-project-automation github-project-automation Bot moved this from Waiting on Author to Done in SIG Node: code and documentation PRs Apr 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/blog Issues or PRs related to the Kubernetes Blog subproject cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. language/en Issues or PRs related to English language lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/docs Categorizes an issue or PR as relevant to SIG Docs. sig/node Categorizes an issue or PR as relevant to SIG Node. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Development

Successfully merging this pull request may close these issues.

10 participants