Skip to content

[cri] podSandboxStats should return errdefs.ErrUnavailable without wrapping#9897

Closed
dims wants to merge 1 commit intocontainerd:mainfrom
dims:cri-podSandboxStats-should-return-errdefs.ErrUnavailable-without-wrapping
Closed

[cri] podSandboxStats should return errdefs.ErrUnavailable without wrapping#9897
dims wants to merge 1 commit intocontainerd:mainfrom
dims:cri-podSandboxStats-should-return-errdefs.ErrUnavailable-without-wrapping

Conversation

@dims
Copy link
Copy Markdown
Member

@dims dims commented Feb 28, 2024

kubelet is calling CRI's ListPodSandboxStats, when this method is looping through sandbox(es), it calls podSandboxStats and expects to get info about sandboxes that are StateReady.

ListPodSandboxStats expects errdefs.ErrUnavailable when sandbox is NOT StateReady. But in the current implementation, the ErrUnavailable is wrapped into another one using fmt.Errorf so the ListPodSandboxStats fails fast without skipping the sandboxes that are not ready.

from: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gce-alpha-enabled-default/1762909002125021184
test: Kubernetes e2e suite: [It] [sig-node] NodeProblemDetector [NodeFeature:NodeProblemDetector] should run without error
log snippet:

STEP: Gather node-problem-detector cpu and memory stats - k8s.io/kubernetes/test/e2e/node/node_problem_detector.go:192 @ 02/28/24 17:19:35.031
I0228 17:19:36.098816 10590 node_problem_detector.go:380] Unexpected error: 
    <*errors.StatusError | 0xc004515040>: 
    an error on the server ("Internal Error: failed to list pod stats: rpc error: code = NotFound desc = failed to decode sandbox container metrics for sandbox \"2c95299f99fa7b8a03e66be60d2a4ddd51bf2501d08d2bcf7f6ba8203f57c6de\": no running task found: task bc6c3f30a0a61641cbe56bdbfb867e195b463f92609505972ee2880dbe4826d8 not found: not found") has prevented the request from succeeding (get nodes bootstrap-e2e-minion-group-1526:10250)
    {
        ErrStatus: 
            code: 500
            details:
              causes:
              - message: 'Internal Error: failed to list pod stats: rpc error: code = NotFound
                  desc = failed to decode sandbox container metrics for sandbox "2c95299f99fa7b8a03e66be60d2a4ddd51bf2501d08d2bcf7f6ba8203f57c6de":
                  no running task found: task bc6c3f30a0a61641cbe56bdbfb867e195b463f92609505972ee2880dbe4826d8
                  not found: not found'
                reason: UnexpectedServerResponse
              kind: nodes
              name: bootstrap-e2e-minion-group-1526:10250
            message: 'an error on the server ("Internal Error: failed to list pod stats: rpc error:
              code = NotFound desc = failed to decode sandbox container metrics for sandbox \"2c95299f99fa7b8a03e66be60d2a4ddd51bf2501d08d2bcf7f6ba8203f57c6de\":
              no running task found: task bc6c3f30a0a61641cbe56bdbfb867e195b463f92609505972ee2880dbe4826d8
              not found: not found") has prevented the request from succeeding (get nodes bootstrap-e2e-minion-group-1526:10250)'
            metadata: {}
            reason: InternalError
            status: Failure,
    }
[FAILED] an error on the server ("Internal Error: failed to list pod stats: rpc error: code = NotFound desc = failed to decode sandbox container metrics for sandbox \"2c95299f99fa7b8a03e66be60d2a4ddd51bf2501d08d2bcf7f6ba8203f57c6de\": no running task found: task bc6c3f30a0a61641cbe56bdbfb867e195b463f92609505972ee2880dbe4826d8 not found: not found") has prevented the request from succeeding (get nodes bootstrap-e2e-minion-group-1526:10250)

@dmcgowan
Copy link
Copy Markdown
Member

The bug here seems to be on the check side rather than generation side. The wrapping is done correctly, somewhere isn't unwrapping though like it should.

@dims
Copy link
Copy Markdown
Member Author

dims commented Feb 28, 2024

/retest

3 similar comments
@dims
Copy link
Copy Markdown
Member Author

dims commented Feb 29, 2024

/retest

@dims
Copy link
Copy Markdown
Member Author

dims commented Feb 29, 2024

/retest

@dims
Copy link
Copy Markdown
Member Author

dims commented Feb 29, 2024

/retest


if sandbox.Status.Get().State != sandboxstore.StateReady {
return nil, fmt.Errorf("failed to get pod sandbox stats since sandbox container %q is not in ready state: %w", meta.ID, errdefs.ErrUnavailable)
log.G(ctx).WithField("podsandboxid", sandbox.ID).Debugf("failed to get pod sandbox stats since sandbox container %q is not in ready state", meta.ID)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it client side issue? I mean that we use fmt.Errorf(%w) to mark root cause as ErrUnavailable. The client side should unwrap error.

Copy link
Copy Markdown
Member Author

@dims dims Feb 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fuweid not a client side issue. please see where this is called from within containerd. (wrote a long PR text to describe why this is needed)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I didn't follow the description.

for _, sandbox := range sandboxes {
sandboxStats, err := c.podSandboxStats(ctx, sandbox)
switch {
case errdefs.IsUnavailable(err):
log.G(ctx).WithField("podsandboxid", sandbox.ID).Debugf("failed to get pod sandbox stats, this is likely a transient error: %v", err)
case err != nil:
errs = append(errs, fmt.Errorf("failed to decode sandbox container metrics for sandbox %q: %w", sandbox.ID, err))
default:
podSandboxStats.Stats = append(podSandboxStats.Stats, sandboxStats)
}
}

Checked the main branch and we did expect err to unavailable.

And the error looks weird because it doesn't have unavailable information. Do you know how to check what containerd version used in CI?

: failed to get pod sandbox stats since sandbox container \"c821a1cfa253eca57db68a1ae85184359d1422c7595982e32f47addeb8d8a78e\" is not in ready state"

f914edf

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

containerd://1.7.13

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fuweid yes originally i saw it in 1.7.13, but inspecting the code in main i see the issue can be here as well. right?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to fix it first here and then cherry pick back to 1.7 branch

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. main doesn't have the issue you mentioned. please provide the log. thanks

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the ErrUnabailable error, we did use errors.Is to check. That's why the issue has been fixed.

Based on your description, the error is like

no running task found

I think we should skip task not found error instead of unavailable.

Copy link
Copy Markdown
Member

@fuweid fuweid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's v1.7 bug instead of main

#8671

@dims
Copy link
Copy Markdown
Member Author

dims commented Mar 1, 2024

ok i had forgotten the unwrap/compare!! thanks

https://go.dev/play/p/ThFh6qN5ait?v=gotip

true
true

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants