Skip to content

Set timeout when collecting metrics from shim's Stat#6781

Merged
estesp merged 1 commit intocontainerd:mainfrom
phanhuyn:metrics-collector-stats-timeout
Apr 19, 2022
Merged

Set timeout when collecting metrics from shim's Stat#6781
estesp merged 1 commit intocontainerd:mainfrom
phanhuyn:metrics-collector-stats-timeout

Conversation

@phanhuyn
Copy link
Copy Markdown
Contributor

@phanhuyn phanhuyn commented Apr 6, 2022

Issue

  • When containerd collects metrics from shim's Stat API, there is no timeout set:

ctx := namespaces.WithNamespace(context.Background(), t.Namespace())

ctx := namespaces.WithNamespace(context.Background(), t.Namespace())

  • If the shim process is not responsive, the request blocks there, Collect.Collect() blocks.

    func (c *Collector) Collect(ch chan<- prometheus.Metric) {

  • This causes leaked goroutines and memory. Requests to prometheus metrics /v1/metrics to hangs when there is a "bad shim"

Reproducing

  • Launch containerd with "metrics" and "debug" enable
cat /etc/containerd/config.toml
[metrics]
    address = ":8080"
[debug]
    address = "/run/containerd/debug.sock"
  • I launched a container with a mock shim implementation that hangs on Stats: phanhuyn@a6add78

  • Observe request to prometheus endpoint hangs

curl localhost:8080/v1/metrics
# hangs
  • Goroutines dump of containerd daemon shows hangs in the Stats call to the shim process
ctr pprof goroutines > /tmp/containerd_goroutines

grep "/metrics.go" /tmp/containerd_goroutines
        /go/src/github.com/containerd/containerd/metrics/cgroups/v1/metrics.go:117 +0x233
        /go/src/github.com/containerd/containerd/metrics/cgroups/v1/metrics.go:126 +0x118
        /go/src/github.com/containerd/containerd/metrics/cgroups/v1/metrics.go:104 +0xb1

Fix

  • Send a request to shim with timeout. Default to 2 seconds.
  • I repeat the same steps to verify. After the fix, containerd daemon returns Prometheus metrics even when one of the shim hangs, containerd logs show a request timeout to stat:
Apr 06 09:00:04 ubuntu-xenial containerd[25364]: time="2022-04-06T09:00:04.189993886Z" level=error msg="stat task test1" error="context deadline exceeded: unknown"

Timeout setting

The default timeout is set to 2 seconds. Can be configured by:

[timeouts]
    "io.containerd.timeout.metrics.shimstats" = "10s"

Signed-off-by: Nguyen Phan Huy [email protected]

@k8s-ci-robot
Copy link
Copy Markdown

Hi @phanhuy1502. Thanks for your PR.

I'm waiting for a containerd member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@phanhuyn phanhuyn force-pushed the metrics-collector-stats-timeout branch 2 times, most recently from e7bb2b8 to f7cc385 Compare April 6, 2022 10:09
@phanhuyn
Copy link
Copy Markdown
Contributor Author

phanhuyn commented Apr 6, 2022

The goroutine dump for references.
containerd_goroutines.txt

@theopenlab-ci
Copy link
Copy Markdown

theopenlab-ci Bot commented Apr 6, 2022

Build succeeded.

@mikebrow
Copy link
Copy Markdown
Member

mikebrow commented Apr 6, 2022

/ok-to-test

Copy link
Copy Markdown
Member

@fuweid fuweid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left comment about fixed timeout

Comment thread metrics/metrics.go Outdated
@phanhuyn phanhuyn marked this pull request as draft April 7, 2022 02:35
@phanhuyn phanhuyn force-pushed the metrics-collector-stats-timeout branch from f7cc385 to 58fdbb1 Compare April 7, 2022 03:54
@theopenlab-ci
Copy link
Copy Markdown

theopenlab-ci Bot commented Apr 7, 2022

Build succeeded.

@phanhuyn phanhuyn marked this pull request as ready for review April 7, 2022 05:27
@phanhuyn
Copy link
Copy Markdown
Contributor Author

Left comment about fixed timeout

I've updated to use the timeout package

@samuelkarp
Copy link
Copy Markdown
Member

Would you mind rebasing and resolving the merge conflicts?

@phanhuyn phanhuyn force-pushed the metrics-collector-stats-timeout branch from 58fdbb1 to c525aa5 Compare April 12, 2022 04:02
@phanhuyn
Copy link
Copy Markdown
Contributor Author

Would you mind rebasing and resolving the merge conflicts?

updated, thanks

@theopenlab-ci
Copy link
Copy Markdown

theopenlab-ci Bot commented Apr 12, 2022

Build succeeded.

Copy link
Copy Markdown
Member

@samuelkarp samuelkarp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Copy Markdown
Member

@estesp estesp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@estesp estesp merged commit 45b8689 into containerd:main Apr 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants