Skip to content

[cri] ListPodSandboxStats should skip sandboxes whose tasks are missing #10013

@dims

Description

@dims

Description

We have a CI job in k8s - https://testgrid.k8s.io/google-gce#gci-gce-alpha-enabled-default&width=20 where we are consistently seeing errors from ListPodSandboxStats

In https://github.com/containerd/containerd/pull/9905/files we handled the condition where we were getting errdefs.ErrUnavailable i think we need to handle this case as well

if errdefs.IsNotFound(err) {
return nil, fmt.Errorf("no running task found: %w", err)
}

STEP: Gather node-problem-detector cpu and memory stats - k8s.io/kubernetes/test/e2e/node/node_problem_detector.go:192 @ 03/28/24 11:47:15.168
I0328 11:47:20.969542 10424 node_problem_detector.go:380] Unexpected error: 
    <*errors.StatusError | 0xc00280a280>: 
    an error on the server ("Internal Error: failed to list pod stats: rpc error: code = NotFound desc = 2 errors occurred:\n\t* failed to decode sandbox container metrics for sandbox \"44c8a7812bfbbd43c9607c017f77db5dd976d774d14d9881d1d7c63f8c3e76fd\": no running task found: task a1416456bfb47e71f5446700fef5f24b7fe31f965017df25ce85a4d37108af82 not found: not found\n\t* failed to decode sandbox container metrics for sandbox \"61f4e0d6e4b3d1ee296908ebeaf8a2668c2e6b5e895906b277b945ea40fa5393\": no running task found: task 1b3c578821277faf98e6355697309c7db529daeb7e42e51bf8355b366a95449d not found: not found") has prevented the request from succeeding (get nodes bootstrap-e2e-minion-group-3q6b:10250)
    {
        ErrStatus: 
            code: 500
            details:
              causes:
              - message: "Internal Error: failed to list pod stats: rpc error: code = NotFound
                  desc = 2 errors occurred:\n\t* failed to decode sandbox container metrics for
                  sandbox \"44c8a7812bfbbd43c9607c017f77db5dd976d774d14d9881d1d7c63f8c3e76fd\":
                  no running task found: task a1416456bfb47e71f5446700fef5f24b7fe31f965017df25ce85a4d37108af82
                  not found: not found\n\t* failed to decode sandbox container metrics for sandbox
                  \"61f4e0d6e4b3d1ee296908ebeaf8a2668c2e6b5e895906b277b945ea40fa5393\": no running
                  task found: task 1b3c578821277faf98e6355697309c7db529daeb7e42e51bf8355b366a95449d
                  not found: not found"
                reason: UnexpectedServerResponse
              kind: nodes
              name: bootstrap-e2e-minion-group-3q6b:10250
            message: 'an error on the server ("Internal Error: failed to list pod stats: rpc error:
              code = NotFound desc = 2 errors occurred:\n\t* failed to decode sandbox container
              metrics for sandbox \"44c8a7812bfbbd43c9607c017f77db5dd976d774d14d9881d1d7c63f8c3e76fd\":
              no running task found: task a1416456bfb47e71f5446700fef5f24b7fe31f965017df25ce85a4d37108af82
              not found: not found\n\t* failed to decode sandbox container metrics for sandbox
              \"61f4e0d6e4b3d1ee296908ebeaf8a2668c2e6b5e895906b277b945ea40fa5393\": no running
              task found: task 1b3c578821277faf98e6355697309c7db529daeb7e42e51bf8355b366a95449d
              not found: not found") has prevented the request from succeeding (get nodes bootstrap-e2e-minion-group-3q6b:10250)'
            metadata: {}
            reason: InternalError
            status: Failure,
    }
[FAILED] an error on the server ("Internal Error: failed to list pod stats: rpc error: code = NotFound desc = 2 errors occurred:\n\t* failed to decode sandbox container metrics for sandbox \"44c8a7812bfbbd43c9607c017f77db5dd976d774d14d9881d1d7c63f8c3e76fd\": no running task found: task a1416456bfb47e71f5446700fef5f24b7fe31f965017df25ce85a4d37108af82 not found: not found\n\t* failed to decode sandbox container metrics for sandbox \"61f4e0d6e4b3d1ee296908ebeaf8a2668c2e6b5e895906b277b945ea40fa5393\": no running task found: task 1b3c578821277faf98e6355697309c7db529daeb7e42e51bf8355b366a95449d not found: not found") has prevented the request from succeeding (get nodes bootstrap-e2e-minion-group-3q6b:10250)
In [It] at: k8s.io/kubernetes/test/e2e/node/node_problem_detector.go:380 @ 03/28/24 11:47:20.969
```****

### Steps to reproduce the issue

the CI jobs can be modified to run with newer versions of containerd.

### Describe the results you received and expected

`ListPodSandboxStats` should succeed with whatever pods it can process

### What version of containerd are you using?

1.7.14

### Any other relevant information

containerd version 1.7.14

### Show configuration if it is related to CRI plugin.

not applicable

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions