Skip to content

Feature Request: Reduce PSI Metric Cardinality to eliminate the impact to prometheus RSS #136642

@qiliRedHat

Description

@qiliRedHat

Summary
I did a performance evaluation for enabling PSI at scale. with 500+ test containers on a cluster, enabling PSI causes Prometheus pod RSS increase up to 1.3+ GB on each Prometheus pod.
I noticed the "" and "POD" containers are included in the PSI metrics. Dropping them would eliminate approximately 66% of the PSI series.

Analysis
PSI metrics: Kubernetes lets you configure the kubelet to collect Linux kernel Pressure Stall Information (PSI) for CPU, memory, and I/O usage. The information is collected at node, pod and container level, exposed at the /metrics/cadvisor endpoint.

For each pod, PSI metrics are emitted not only for application containers but also for two additional containers:

  • container="" (pause container - infra)
  • container="POD" (pod cgroup)

Here is an example of one of the 3 PSI metrics for a pod

container_pressure_cpu_waiting_seconds_total{container="",id="/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod43e94c3b_60b2_463c_bb0c_bb10d153e49d.slice",image="",name="",namespace="node-density-heavy-0",pod="perfapp-1-1-bc966c69-h6c77"} 1.456709 1769503672619

container_pressure_cpu_waiting_seconds_total{container="POD",id="/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod43e94c3b_60b2_463c_bb0c_bb10d153e49d.slice/crio-d9ee10c5cdfc43b2bf36c7af5e34cffd4c353e09de52556c06ab98ee25d89310",image="",name="k8s_POD_perfapp-1-1-bc966c69-h6c77_node-density-heavy-0_43e94c3b-60b2-463c-bb0c-bb10d153e49d_0",namespace="node-density-heavy-0",pod="perfapp-1-1-bc966c69-h6c77"} 0 1769503667734

container_pressure_cpu_waiting_seconds_total{container="perfapp",id="/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod43e94c3b_60b2_463c_bb0c_bb10d153e49d.slice/crio-cf24cfb805081bc45e300e9c041d123b78bc97167e43831d83bb1b1c1bfd7609.scope",image="quay.io/cloud-bulldozer/perfapp:latest",name="k8s_perfapp_perfapp-1-1-bc966c69-h6c77_node-density-heavy-0_43e94c3b-60b2-463c-bb0c-bb10d153e49d_0",namespace="node-density-heavy-0",pod="perfapp-1-1-bc966c69-h6c77"} 1.295762 1769503671596

cAdvisor emits PSI metrics for every relevant cgroup. For a pod with a single application container, this means:
3 containers × 6 PSI metric types = 18 total PSI metrics per pod.

  • container_pressure_cpu_stalled_seconds_total
  • container_pressure_cpu_waiting_seconds_total
  • container_pressure_memory_stalled_seconds_total
  • container_pressure_memory_waiting_seconds_total
  • container_pressure_io_stalled_seconds_total
  • container_pressure_io_waiting_seconds_total

I investigated other container-level metrics and found inconsistent behavior—some include these two additional containers while others only include a subset of containers.

Configure Prometheus metric relabeling to drop PSI metrics for non-application containers can help to reduce the prometheus resource consumption. But the kubelet.metricRelabelings field is not among the supported parameters in OpenShift until the 4.20 release.
The OpenShift Cluster Monitoring Operator only exposes a subset of Prometheus configuration parameters. According to the Config Map Reference for the Cluster Monitoring Operator:
"Not all configuration parameters for the monitoring stack are exposed. Only the parameters and fields listed in this reference are supported for configuration."

Proposal
Configure Kubernetes/cAdvisor/CRI-O to suppress PSI metrics for pause and pod cgroups("" and "POD" containers ) at the source.
Dropping the "" and "POD" containers would eliminate approximately 66% of the PSI series from the source.

Open Question

  • Is there strong reason to avoid dropping "" and "POD" containers because they are still meaningful to PSI metrics?
  • The effort to drop them v.s. the additional resource consumption them bring.

Metadata

Metadata

Assignees

No one assigned

    Labels

    needs-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.sig/nodeCategorizes an issue or PR as relevant to SIG Node.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions