Skip to content

CRI sandbox and container stats should be cached to avoid perf issues #7246

@bobbypage

Description

@bobbypage

Description

The ListContainerStats and ListPodSandboxStats as implemented in containerd CRI currently fetch all of the stats when the RPC is called and does not do any caching.

This can be quite expensive, as for example for ListContainerStats, a metrics request

request, containers, err := c.buildTaskMetricsRequest(in)
is built which requires going through every single containerd-shim, sending a TTRPC request, each shim, needs to scrape cgroupfs to get stats, and the data needs to be returned. When there are many containers running at a time, this can cause high CPU load and instability.

For ListPodSandboxStats, it is similar, we need to scrape the sandbox stats:

metrics, err := metricsForSandbox(sandbox)
however, it's even worse because we also need to get stats for each of the containers within the sandbox.
listContainerStatsRequest := &runtime.ListContainerStatsRequest{Filter: &runtime.ContainerStatsFilter{PodSandboxId: meta.ID}}
resp, err := c.ListContainerStats(ctx, listContainerStatsRequest)
This means to return full ListPodSandboxStats, we need to get metrics for all sandboxes and for each sandbox send TTRPC request to the shim to get container stats.

The recommendation to solve issues would be to fetch the stats for sandboxes and containers in the background periodically and cache them. Then when the RPC to get data about them comes in, we can serve the data from local memory cache. This is similar to what cAdvisor does already today. Additionally, if we collect stats in the background, we should should probably avoid collecting all of the stats for all containers at once, and instead add some jitter and collect them on a period interval, e.g. see https://github.com/google/cadvisor/blob/86b11c65eae6682a4c0d1b0ffaaa091aec701e56/manager/container.go#L482-L506

Since these RPCs will become more used as part of kubernetes/enhancements#2371 it's important these RPCs will be fast and low overhead. Also see cri-o which is performing this collection in the background already (https://github.com/cri-o/cri-o/blob/main/internal/lib/stats/stats_server.go)

Steps to reproduce the issue

n/a

Describe the results you received and expected

n/a

What version of containerd are you using?

n/a

Any other relevant information

No response

Show configuration if it is related to CRI plugin.

n/a

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions