Skip to content

Commit 47fac02

Browse files
polish and added example per feedback
1 parent 64abf21 commit 47fac02

2 files changed

Lines changed: 62 additions & 2 deletions

File tree

content/en/docs/concepts/cluster-administration/system-metrics.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -256,9 +256,9 @@ This returns the information in a json format as such.
256256
}
257257
```
258258

259-
Here is a simple spike scenario. The `avg10` value of `0.74` indicates that in the last 10 seconds, at least one task in this container was stalled on the CPU for 0.74% of the time (0.0074 seconds or 74 milliseconds). Because `avg10` (0.74) is significantly higher than `avg300` (0.21) on the same resource, this suggests a recent surge in resource contention rather than a sustained long-term bottleneck. If monitored continuously and the `avg300` metrics increase as well, we can diagnose a more serious, lasting issue!
259+
Here is a simple spike scenario. The cpu.some `avg10` value of `0.74` indicates that in the last 10 seconds, at least one task in this container was stalled on the CPU for 0.74% of the time (0.0074 seconds or 74 milliseconds). Because `avg10` (0.74) is significantly higher than `avg300` (0.21) on the same resource, this suggests a recent surge in resource contention rather than a sustained long-term bottleneck. If monitored continuously and the `avg300` metrics increase as well, we can diagnose a more serious, lasting issue!
260260

261-
Additionally, notice how in this example `cpu.some` shows pressure, while `cpu.full` remains at 0.00. This tells us that while some processes were delayed waiting for CPU time, the container as a whole was still making progress. A non-zero full value would indicate that all non-idle tasks were stalled simultaneously - a much bigger problem.
261+
Additionally, notice how in this example `cpu.some` shows pressure, while `cpu.full` remains at 0.00. This tells us that while some processes were delayed waiting for CPU time, the container as a whole was still making progress. A non-zero full value would indicate that all non-idle tasks were stalled simultaneously, a much bigger problem.
262262
Although not as human-readable, the `total` value of 35232438 represents the cumulative stall time in microseconds, that allow latency spike detection that otherwise may not show in the averages. They are also useful for monitoring systems, like Prometheus, to calculate precise rates of increase over specific time windows.
263263
As a final note, when observing high I/O Pressure alongside low Memory Pressure, it can indicate that the application is waiting on disk throughput rather than failing due to a lack of available RAM. The node is not over-committed on memory, and a different diagnosis for disk consumption can be investigated.
264264

content/en/docs/reference/instrumentation/understand-psi-metrics.md

Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,66 @@ Pressure Stall Information (PSI) metrics are provided for three resources: CPU,
3939

4040
Each pressure type provides four metrics: `avg10`, `avg60`, `avg300`, and `total`. The `avg` values represent the percentage of wall-clock time that tasks were stalled over 10-second, 60-second, and 5-minute moving averages. The `total` value is a cumulative counter in microseconds showing the total time tasks have been stalled.
4141

42+
Let's take for example the following query from the Summary API:
43+
`kubectl get --raw "/api/v1/nodes/$(kubectl get nodes -o jsonpath='{.items[0].metadata.name}')/proxy/stats/summary" | jq '.pods[].containers[] | select(.name=="<CONTAINER_NAME>") | {name, cpu: .cpu.psi, memory: .memory.psi, io: .io.psi}'`.
44+
This returns the information in a json format as such.
45+
46+
```
47+
{
48+
"name": "<CONTAINER_NAME>",
49+
"cpu": {
50+
"full": {
51+
"total": 0,
52+
"avg10": 0,
53+
"avg60": 0,
54+
"avg300": 0
55+
},
56+
"some": {
57+
"total": 35232438,
58+
"avg10": 0.74,
59+
"avg60": 0.52,
60+
"avg300": 0.21,
61+
},
62+
},
63+
"memory": {
64+
"full": {
65+
"total": 539105,
66+
"avg10": 0,
67+
"avg60": 0,
68+
"avg300": 0
69+
},
70+
"some": {
71+
"total": 658164,
72+
"avg10": 0.01,
73+
"avg60": 0.01,
74+
"avg300": 0.00,
75+
},
76+
}
77+
},
78+
"io": {
79+
"full": {
80+
"total": 33190987,
81+
"avg10": 0.31,
82+
"avg60": 0.22,
83+
"avg300": 0.05,
84+
},
85+
"some": {
86+
"total": 40809937,
87+
"avg10": 0.52,
88+
"avg60": 0.45,
89+
"avg300": 0.12,
90+
}
91+
}
92+
}
93+
```
94+
95+
Here is a simple spike scenario. The cpu.some `avg10` value of `0.74` indicates that in the last 10 seconds, at least one task in this container was stalled on the CPU for 0.74% of the time (0.0074 seconds or 74 milliseconds). Because `avg10` (0.74) is significantly higher than `avg300` (0.21) on the same resource, this suggests a recent surge in resource contention rather than a sustained long-term bottleneck. If monitored continuously and the `avg300` metrics increase as well, we can diagnose a more serious, lasting issue!
96+
97+
Additionally, notice how in this example `cpu.some` shows pressure, while `cpu.full` remains at 0.00. This tells us that while some processes were delayed waiting for CPU time, the container as a whole was still making progress. A non-zero full value would indicate that all non-idle tasks were stalled simultaneously, a much bigger problem.
98+
Although not as human-readable, the `total` value of 35232438 represents the cumulative stall time in microseconds, that allow latency spike detection that otherwise may not show in the averages.
99+
100+
As a final note, when observing high I/O Pressure alongside low Memory Pressure, it can indicate that the application is waiting on disk throughput rather than failing due to a lack of available RAM. The node is not over-committed on memory, and a different diagnosis for disk consumption can be investigated.
101+
42102
## Example Scenarios
43103

44104
You can use a simple Pod with a stress-testing tool to generate resource pressure and observe the PSI metrics. The following examples use the `agnhost` container image, which includes the `stress` tool.

0 commit comments

Comments
 (0)