Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[release/1.7] Handle teardown failure to avoid blocking cleanup #10770

Open
wants to merge 1 commit into
base: release/1.7
Choose a base branch
from

Conversation

matteopulcini6
Copy link

@matteopulcini6 matteopulcini6 commented Oct 4, 2024

Follow up to #10744 (comment)

MIke: The scenario here is CNI is not configured / installed .. there are no plugins.. cni setup fails.. returns error and CNIResult is nil... While processing the defer chain to clean up the partially created pod.. cni tear down will also fail because.. Subsequently netns is not destroyed and then because netns is not destroyed the pod is created in the not cleaned up state. The user must manually attempt stop at this point. In the case where cni is not configured, we should not require it to be configured to clean up.

@k8s-ci-robot
Copy link

Hi @matteopulcini6. Thanks for your PR.

I'm waiting for a containerd member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@dosubot dosubot bot added the area/cri Container Runtime Interface (CRI) label Oct 4, 2024
@matteopulcini6 matteopulcini6 force-pushed the sandbox-deferring-teardown branch from 7e5a8b8 to 50dc942 Compare October 4, 2024 15:20
Copy link
Member

@mikebrow mikebrow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking good need to also put this new block down on line 451 in the user ns networking block

@mikebrow
Copy link
Member

mikebrow commented Oct 4, 2024

/ok-to-test

Copy link
Member

@mikebrow mikebrow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM on green

@mikebrow mikebrow added cherry-pick/1.6.x Change to be cherry picked to release/1.6 branch cherry-pick/1.7.x Change to be cherry picked to release/1.7 branch labels Oct 4, 2024
@matteopulcini6
Copy link
Author

/retest

@mikebrow
Copy link
Member

mikebrow commented Oct 4, 2024

failures are not related to PR.. looks like the container registry is blocking.. will try again later.

@sameersaeed
Copy link
Contributor

Sample flow

pod-config.json:, per crictl docs

{
  "metadata": {
    "name": "nginx-sandbox",
    "namespace": "default",
    "attempt": 1,
    "uid": "65fb5845de97d8a4"
  },
  "log_directory": "/tmp",
  "linux": {
  }
}

Current behaviour from running sandbox pod - receive CNI plugin error, but pod still comes up:

root@32748c76b2ba:/src# crictl pods
POD ID              CREATED             STATE               NAME                NAMESPACE           ATTEMPT             RUNTIME
root@32748c76b2ba:/src# crictl runp pod-config.json
E1004 17:42:14.071632   31488 remote_runtime.go:176] "RunPodSandbox from runtime service failed" err="rpc error: code = Unknown desc = failed to setup network for sandbox \"803615d3f284991954f660a4d85295c15839ba8fa574640d58dd02de6a248d80\": cni plugin not initialized"
FATA[0000] run pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "803615d3f284991954f660a4d85295c15839ba8fa574640d58dd02de6a248d80": cni plugin not initialized 
root@32748c76b2ba:/src# crictl pods
POD ID              CREATED             STATE               NAME                NAMESPACE           ATTEMPT             RUNTIME
803615d3f2849       1 second ago        NotReady            nginx-sandbox       default             1                   (default)

Changes made

Current code (release 1.7):

// Teardown network if an error is returned.
if cleanupErr = c.teardownPodNetwork(deferCtx, sandbox); cleanupErr != nil {
log.G(ctx).WithError(cleanupErr).Errorf("Failed to destroy network for sandbox %q", id)
}

// Teardown network if an error is returned.
if cleanupErr = c.teardownPodNetwork(deferCtx, sandbox); cleanupErr != nil {
log.G(ctx).WithError(cleanupErr).Errorf("Failed to destroy network for sandbox %q", id)
}

Changes were added to the above sandbox_run.go code to skip the pod network teardown if no CNI plugins are initialized. This allows for the rest of the deferred cleanup steps to run so that the CNI plugins failure does not block the pod from being fully cleaned up as expected:

// Teardown network if an error is returned.
if cleanupErr = c.teardownPodNetwork(deferCtx, sandbox); cleanupErr != nil {
	log.G(ctx).WithError(cleanupErr).Errorf("Failed to destroy network for sandbox %q", id)

	// ignoring failed to destroy networks when we failed to setup networks
	if sandbox.CNIResult == nil {
		cleanupErr = nil
	}
}
// Teardown network if an error is returned.
if cleanupErr = c.teardownPodNetwork(deferCtx, sandbox); cleanupErr != nil {
	log.G(ctx).WithError(cleanupErr).Errorf("Failed to destroy network for sandbox %q", id)

	// ignoring failed to destroy networks when we failed to setup networks
	if sandbox.CNIResult == nil {
		cleanupErr = nil
	}
}

This makes it so that a pod cannot be started if no CNI plugins are initialized:

root@32748c76b2ba:/src# crictl pods
POD ID              CREATED             STATE               NAME                NAMESPACE           ATTEMPT             RUNTIME
root@32748c76b2ba:/src# crictl runp pod-config.json
E1004 17:40:26.373297   29655 remote_runtime.go:176] "RunPodSandbox from runtime service failed" err="rpc error: code = Unknown desc = failed to setup network for sandbox \"bd9a6545dc2f9f1484b36a3b72b2c87d2e44b9f74fe5ffc90f53fa220d73462b\": cni plugin not initialized"
FATA[0000] run pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "bd9a6545dc2f9f1484b36a3b72b2c87d2e44b9f74fe5ffc90f53fa220d73462b": cni plugin not initialized 
root@32748c76b2ba:/src# crictl pods
POD ID              CREATED             STATE               NAME                NAMESPACE           ATTEMPT             RUNTIME

@matteopulcini6 matteopulcini6 changed the title Handle teardown failure to avoid blocking cleanup [Release/1.7] Handle teardown failure to avoid blocking cleanup Oct 5, 2024
fuweid
fuweid previously requested changes Oct 7, 2024
Copy link
Member

@fuweid fuweid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would you please squash the commits? Thanks

@matteopulcini6 matteopulcini6 force-pushed the sandbox-deferring-teardown branch from d277591 to 9fe6976 Compare October 7, 2024 22:48
@matteopulcini6
Copy link
Author

@fuweid commits are squashed

@matteopulcini6 matteopulcini6 requested a review from fuweid October 7, 2024 23:41
@dmcgowan dmcgowan added impact/changelog and removed cherry-pick/1.7.x Change to be cherry picked to release/1.7 branch labels Oct 8, 2024
@dmcgowan
Copy link
Member

dmcgowan commented Oct 8, 2024

This had the 1.7 cherry pick label, is this change already in main?

@matteopulcini6
Copy link
Author

@mikebrow @dmcgowan I didn't see the change in main. Also submitted this for release/1.6 [Release/1.6 Handle teardown failure to avoid blocking cleanup #10778

Please note: I only saw one teardownPodNetwork for 1.6 hence why the the cleanupErr is set to nil only once.

Copy link
Member

@dmcgowan dmcgowan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will deviate from main, a change needs to get merged there first

@matteopulcini6
Copy link
Author

@dmcgowan Added the PR here #10800, but fyi it looks like there is one teardown call in release/1.6, but there are two for release/1.7

@mikebrow
Copy link
Member

This will deviate from main, a change needs to get merged there first

Nod.. main first.. overlooked release in my review when I tagged it cherry pick.

@kzys kzys changed the title [Release/1.7] Handle teardown failure to avoid blocking cleanup [release/1.7] Handle teardown failure to avoid blocking cleanup Oct 18, 2024
@kzys
Copy link
Member

kzys commented Oct 18, 2024

I've merged #10778 and realized this discussion 🤦

So, since #10839 has been merged into main. Are we good to go? The commit message could be cleaned up though.

Copy link
Member

@mikebrow mikebrow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with the commit header cleanup

@mikebrow
Copy link
Member

mikebrow commented Oct 22, 2024

@matteopulcini6 pls do a commit --amend and remove the extra commit header signatures and only need one 69 char description line for it .. Cheers
from:

Signed-off-by: [email protected]

Fixed Linting: gofmt

Signed-off-by: [email protected]

Handle teardown failure in user ns networking block

Signed-off-by:  [email protected]
Signed-off-by: Matteo Pulcini <[email protected]>

to

Handle teardown failure in user ns networking block

Signed-off-by: Matteo Pulcini <[email protected]>

@matteopulcini6 matteopulcini6 force-pushed the sandbox-deferring-teardown branch from 9fe6976 to 59d7eed Compare October 22, 2024 14:14
@matteopulcini6
Copy link
Author

/retest

@akhilerm
Copy link
Member

Ping @mikebrow @dmcgowan . Are we good with the changes here? The cherry-pick already got merged in 1.6 branch, but this PR is still open to 1.7. Can we get this in?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cri Container Runtime Interface (CRI) cherry-pick/1.6.x Change to be cherry picked to release/1.6 branch impact/changelog ok-to-test size/S
Projects
Status: Needs Triage
Development

Successfully merging this pull request may close these issues.

9 participants