Description
The ListContainerStats and ListPodSandboxStats as implemented in containerd CRI currently fetch all of the stats when the RPC is called and does not do any caching.
This can be quite expensive, as for example for ListContainerStats, a metrics request
|
request, containers, err := c.buildTaskMetricsRequest(in) |
is built which requires going through every single containerd-shim, sending a TTRPC request, each shim, needs to scrape cgroupfs to get stats, and the data needs to be returned. When there are many containers running at a time, this can cause high CPU load and instability.
For ListPodSandboxStats, it is similar, we need to scrape the sandbox stats:
|
metrics, err := metricsForSandbox(sandbox) |
however, it's even worse because we also need to get stats for each of the containers within the sandbox.
|
listContainerStatsRequest := &runtime.ListContainerStatsRequest{Filter: &runtime.ContainerStatsFilter{PodSandboxId: meta.ID}} |
|
resp, err := c.ListContainerStats(ctx, listContainerStatsRequest) |
This means to return full
ListPodSandboxStats, we need to get metrics for all sandboxes and for each sandbox send TTRPC request to the shim to get container stats.
The recommendation to solve issues would be to fetch the stats for sandboxes and containers in the background periodically and cache them. Then when the RPC to get data about them comes in, we can serve the data from local memory cache. This is similar to what cAdvisor does already today. Additionally, if we collect stats in the background, we should should probably avoid collecting all of the stats for all containers at once, and instead add some jitter and collect them on a period interval, e.g. see https://github.com/google/cadvisor/blob/86b11c65eae6682a4c0d1b0ffaaa091aec701e56/manager/container.go#L482-L506
Since these RPCs will become more used as part of kubernetes/enhancements#2371 it's important these RPCs will be fast and low overhead. Also see cri-o which is performing this collection in the background already (https://github.com/cri-o/cri-o/blob/main/internal/lib/stats/stats_server.go)
Steps to reproduce the issue
n/a
Describe the results you received and expected
n/a
What version of containerd are you using?
n/a
Any other relevant information
No response
Show configuration if it is related to CRI plugin.
n/a
Description
The
ListContainerStatsandListPodSandboxStatsas implemented in containerd CRI currently fetch all of the stats when the RPC is called and does not do any caching.This can be quite expensive, as for example for
ListContainerStats, a metrics requestcontainerd/pkg/cri/server/container_stats_list.go
Line 38 in 1e6523f
For
ListPodSandboxStats, it is similar, we need to scrape the sandbox stats:containerd/pkg/cri/server/sandbox_stats_list.go
Line 36 in 1e6523f
containerd/pkg/cri/server/sandbox_stats_linux.go
Lines 113 to 114 in 1e6523f
ListPodSandboxStats, we need to get metrics for all sandboxes and for each sandbox send TTRPC request to the shim to get container stats.The recommendation to solve issues would be to fetch the stats for sandboxes and containers in the background periodically and cache them. Then when the RPC to get data about them comes in, we can serve the data from local memory cache. This is similar to what cAdvisor does already today. Additionally, if we collect stats in the background, we should should probably avoid collecting all of the stats for all containers at once, and instead add some jitter and collect them on a period interval, e.g. see https://github.com/google/cadvisor/blob/86b11c65eae6682a4c0d1b0ffaaa091aec701e56/manager/container.go#L482-L506
Since these RPCs will become more used as part of kubernetes/enhancements#2371 it's important these RPCs will be fast and low overhead. Also see cri-o which is performing this collection in the background already (https://github.com/cri-o/cri-o/blob/main/internal/lib/stats/stats_server.go)
Steps to reproduce the issue
n/a
Describe the results you received and expected
n/a
What version of containerd are you using?
n/a
Any other relevant information
No response
Show configuration if it is related to CRI plugin.
n/a