Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add check for CNI plugins before tearing down pod network #10744

Merged
merged 1 commit into from
Sep 30, 2024

Conversation

sameersaeed
Copy link
Contributor

Currently, locking behavior occurs if a user creates a sandbox pod and then tries to remove / delete it without having any CNI plugins initialized on their system.

Sample flow

pod-config.json:, per crictl docs

{
  "metadata": {
    "name": "nginx-sandbox",
    "namespace": "default",
    "attempt": 1,
    "uid": "65fb5845de97d8a4"
  },
  "log_directory": "/tmp",
  "linux": {
  }
}

Running sandbox pod - receive CNI plugin error:

root@3cd5ffda91af:~# crictl runp pod-config.json 
WARN[0000] runtime connect using default endpoints: [unix:///run/containerd/containerd.sock unix:///run/crio/crio.sock unix:///var/run/cri-dockerd.sock]. As the default settings are now deprecated, you should set the endpoint instead. 
E0922 01:07:22.941363     315 log.go:32] "RunPodSandbox from runtime service failed" err="rpc error: code = Unknown desc = failed to setup network for sandbox \"c4c57d078fc28ae83f4fd22167a23aa62e5ca77a09e9e1e62de72bc0135031ef\": cni plugin not initialized"
FATA[0003] run pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "c4c57d078fc28ae83f4fd22167a23aa62e5ca77a09e9e1e62de72bc0135031ef": cni plugin not initialized 

Pod shows up as NotReady:

root@3cd5ffda91af:~# crictl pods
WARN[0000] runtime connect using default endpoints: [unix:///run/containerd/containerd.sock unix:///run/crio/crio.sock unix:///var/run/cri-dockerd.sock]. As the default settings are now deprecated, you should set the endpoint instead. 
POD ID              CREATED             STATE               NAME                NAMESPACE           ATTEMPT             RUNTIME
c4c57d078fc28       270 years ago       NotReady            nginx-sandbox       default             1                   (default)

Locking behavior - cannot stop / remove pod due to CNI plugin error:

root@3cd5ffda91af:~# crictl stopp c4
WARN[0000] runtime connect using default endpoints: [unix:///run/containerd/containerd.sock unix:///run/crio/crio.sock unix:///var/run/cri-dockerd.sock]. As the default settings are now deprecated, you should set the endpoint instead. 
E0922 01:07:43.024652     335 log.go:32] "StopPodSandbox from runtime service failed" err="rpc error: code = Unknown desc = failed to destroy network for sandbox \"c4c57d078fc28ae83f4fd22167a23aa62e5ca77a09e9e1e62de72bc0135031ef\": cni plugin not initialized" podSandboxID="c4"
FATA[0000] stopping the pod sandbox "c4": rpc error: code = Unknown desc = failed to destroy network for sandbox "c4c57d078fc28ae83f4fd22167a23aa62e5ca77a09e9e1e62de72bc0135031ef": cni plugin not initialized 
root@3cd5ffda91af:~# crictl rmp c4
WARN[0000] runtime connect using default endpoints: [unix:///run/containerd/containerd.sock unix:///run/crio/crio.sock unix:///var/run/cri-dockerd.sock]. As the default settings are now deprecated, you should set the endpoint instead. 
E0922 01:07:48.578086     345 log.go:32] "RemovePodSandbox from runtime service failed" err="rpc error: code = Unknown desc = failed to forcibly stop sandbox \"c4c57d078fc28ae83f4fd22167a23aa62e5ca77a09e9e1e62de72bc0135031ef\": failed to destroy network for sandbox \"c4c57d078fc28ae83f4fd22167a23aa62e5ca77a09e9e1e62de72bc0135031ef\": cni plugin not initialized" podSandboxID="c4"
removing the pod sandbox "c4": rpc error: code = Unknown desc = failed to forcibly stop sandbox "c4c57d078fc28ae83f4fd22167a23aa62e5ca77a09e9e1e62de72bc0135031ef": failed to destroy network for sandbox "c4c57d078fc28ae83f4fd22167a23aa62e5ca77a09e9e1e62de72bc0135031ef": cni plugin not initialized

Changes made

Current code:

if err := c.teardownPodNetwork(ctx, sandbox); err != nil {
return fmt.Errorf("failed to destroy network for sandbox %q: %w", id, err)
}

Added a minor change to the above sandbox_stop.go code so that the pod network teardown will only be performed if CNIResult returns a valid value:

if sandbox.CNIResult != nil {
	if err := c.teardownPodNetwork(ctx, sandbox); err != nil {
		return fmt.Errorf("failed to destroy network for sandbox %q: %w", id, err)
	}	
}

This avoids the locking behavior, so the pods are able to be stopped and / or removed as expected:

root@03195a9aecb5:/src# crictl rmp c4
DEBU[0000] get runtime connection                       
Stopped sandbox c4
root@03195a9aecb5:/src# crictl rmp c4
DEBU[0000] get runtime connection                       
Removed sandbox c4
root@03195a9aecb5:/src# crictl pods
DEBU[0000] get runtime connection                       
DEBU[0000] ListPodSandboxRequest: &ListPodSandboxRequest{Filter:&PodSandboxFilter{Id:,State:nil,LabelSelector:map[string]string{},},} 
DEBU[0000] ListPodSandboxResponse: []                   
POD ID              CREATED             STATE               NAME                NAMESPACE           ATTEMPT             RUNTIME
``

@k8s-ci-robot
Copy link

Hi @sameersaeed. Thanks for your PR.

I'm waiting for a containerd member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@dosubot dosubot bot added area/cri Container Runtime Interface (CRI) kind/bug labels Sep 27, 2024
@mxpv
Copy link
Member

mxpv commented Sep 27, 2024

/ok-to-test

Copy link
Member

@mikebrow mikebrow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mikebrow mikebrow added the cherry-pick/1.7.x Change to be cherry picked to release/1.7 branch label Sep 30, 2024
@mikebrow
Copy link
Member

/test pull-containerd-k8s-e2e-ec2

@mikebrow
Copy link
Member

mikebrow commented Sep 30, 2024

Additionally, the following needs to be fixed in a follow up PR as the defer'd tear down will also fail and "block" the netns removal https://github.com/containerd/containerd/blob/release/1.6/pkg/cri/server/sandbox_run.go#L312. @fuweid FYI.. probably should not be deferring tear down before the network bring up, or if we do should ignore teardown failure when setup fails. Same for 1.6/7

@mikebrow mikebrow added the cherry-pick/1.6.x Change to be cherry picked to release/1.6 branch label Sep 30, 2024
@mikebrow mikebrow added this pull request to the merge queue Sep 30, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Sep 30, 2024
@mxpv mxpv added this pull request to the merge queue Sep 30, 2024
Merged via the queue into containerd:main with commit 03db11c Sep 30, 2024
53 checks passed
@austinvazquez austinvazquez added cherry-picked/1.6.x PR commits are cherry-picked into release/1.6 branch cherry-picked/1.7.x PR commits are cherry-picked into release/1.7 branch and removed cherry-pick/1.6.x Change to be cherry picked to release/1.6 branch cherry-pick/1.7.x Change to be cherry picked to release/1.7 branch labels Nov 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cri Container Runtime Interface (CRI) cherry-picked/1.6.x PR commits are cherry-picked into release/1.6 branch cherry-picked/1.7.x PR commits are cherry-picked into release/1.7 branch kind/bug ok-to-test size/XS
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants