Skip to content

Handle pod network teardown more carefully when setup failed.#12132

Open
MrHohn wants to merge 1 commit intocontainerd:mainfrom
MrHohn:improve-network-teardown
Open

Handle pod network teardown more carefully when setup failed.#12132
MrHohn wants to merge 1 commit intocontainerd:mainfrom
MrHohn:improve-network-teardown

Conversation

@MrHohn
Copy link
Copy Markdown

@MrHohn MrHohn commented Jul 21, 2025

This is a follow up of both #10744 and #10839, where teardownPodNetwork may be skipped if setup networks failed in the first place.

In the situation illustrated on #12130, for chained CNI pattern the skipping logic may be problematic and leads to network resource leakage (e.g. Pod IP).

This PR attempts to make the sandbox network handling more robust by conditionally skipping the teardown network only when error is related to CNI initialization, while allowing teardown retry to happen if setup already allocated network resoruce partially.

/cc @sameersaeed @mikebrow @aojea

@k8s-ci-robot
Copy link
Copy Markdown

@MrHohn: GitHub didn't allow me to request PR reviews from the following users: aojea, sameersaeed.

Note that only containerd members and repo collaborators can review this PR, and authors cannot review their own PRs.

Details

In response to this:

This is a follow up of both #10744 and #10839, where teardownPodNetwork may be skipped if setup networks failed in the first place.

In the situation illustrated on #12130, for chained CNI pattern the skipping logic may be problematic and leads to network resource leakage (e.g. Pod IP).

This PR attempts to make the sandbox network handling more robust by conditionally ignoring the teardown network when error is related to CNI initialization, while allowing teardown retry to happen when it makes sense.

/cc @sameersaeed @mikebrow @aojea

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot
Copy link
Copy Markdown

Hi @MrHohn. Thanks for your PR.

I'm waiting for a containerd member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@dosubot dosubot Bot added the area/cri Container Runtime Interface (CRI) label Jul 21, 2025
@MrHohn MrHohn force-pushed the improve-network-teardown branch from 35f4699 to 9ba9f99 Compare July 21, 2025 23:40
@sameersaeed
Copy link
Copy Markdown
Contributor

Thanks @MrHohn, I tested creating a new pod and it seems to work as expected - cannot create a new pod when no CNI plugins are initialized:

root@4db2b51bc43c:/src# crictl runp pod-config.json 
E0721 23:45:32.441792   25707 remote_runtime.go:193] "RunPodSandbox from runtime service failed" err="rpc error: code = Unknown desc = failed to setup network for sandbox \"a263eb38687a7c4affe7b7d863c5ca9f3997c4329538ccbe9205cfe21353447d\": cni plugin not initialized"
FATA[0000] run pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "a263eb38687a7c4affe7b7d863c5ca9f3997c4329538ccbe9205cfe21353447d": cni plugin not initialized 
root@4db2b51bc43c:/src# crictl pods
POD ID              CREATED             STATE               NAME                NAMESPACE           ATTEMPT             RUNTIME
root@4db2b51bc43c:/src#

@MrHohn
Copy link
Copy Markdown
Author

MrHohn commented Jul 23, 2025

/cc @estesp

@k8s-ci-robot k8s-ci-robot requested a review from estesp July 23, 2025 17:02
@MrHohn
Copy link
Copy Markdown
Author

MrHohn commented Jul 23, 2025

Adding @estesp from the previous change (#10839) - as we will need 2 approving reviews. Thanks!

@mikebrow
Copy link
Copy Markdown
Member

/ok-to-test

@MrHohn
Copy link
Copy Markdown
Author

MrHohn commented Jul 25, 2025

From the pull-containerd-node-e2e failures - many of them seem irrelevant to the changes here.

/retest

return fmt.Errorf("failed to destroy network for sandbox %q: %w", id, err)
}
if err := c.teardownPodNetwork(ctx, sandbox); err != nil {
return fmt.Errorf("failed to destroy network for sandbox %q: %w", id, err)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why you should call the CNI plugin if you didn;t get any result before?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems that previous if sandbox.CNIResult != nil { is ok

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reverted this change because CNIResult will only be set if setupPodNetwork succeeded. In cases where both the setupPodNetwork and teardownPodNetwork failed during a sandbox creation, we will end up not retrying the teardownPodNetwork despite it failed previously.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, that is why I added this comment #12132 (comment) , the setupPodNetworkerror can be handled directly after the output of that function and not in this magic defer that is so complex to understand ... it also depend on two magic error variables that makes it more confusing

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few more thoughts behind this - trying to merge the conversation into this thread: #12132 (comment).

Comment thread internal/cri/server/sandbox_run.go Outdated
Comment on lines 241 to 248
if cleanupErr = c.teardownPodNetwork(deferCtx, sandbox); cleanupErr != nil {
log.G(ctx).WithError(cleanupErr).Errorf("Failed to destroy network for sandbox %q", id)

// ignoring failed to destroy networks when we failed to setup networks
if sandbox.CNIResult == nil {
// Ignoring network cleanup error if the CNI config/plugin failed to be initialized.
// It is safe to ignore in this scenario as no network plugins should have been executed.
if errors.Is(cleanupErr, ErrCNIConfigNotInitialized) ||
errors.Is(cleanupErr, cni.ErrCNINotInitialized) {
cleanupErr = nil
Copy link
Copy Markdown
Contributor

@aojea aojea Jul 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what makes this complex to think about is to embed all the logic into this big defer, I think that is better to compartmentalize this logic and handle the errors that come from c.setupPodNetwork directly there, if the network plugin is not initialized you don't have to call teardownPodNetwork and just return an error and this remains the same if sandbox.CNIResult == nil

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comment Antonio. +1 that baking more complex logic into this defer block makes it hard to reason about the behavior. I will take your advice and see how to handle error from c.setupPodNetwork directly.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Sorry for the long comment! I was thinking out loud while amending the codes.)

Coming back to this - I looked around how TeardownPodNetwork() is triggered and didn't find a ideal way yet to handle this. Would like to probe for more inputs here.

So there are two separate issues this PR is focusing on (the aim is to address both):

  1. When SetupPodNetwork fails completely (e.g. due to CNI not initialized), the sandbox deletion will be blocked by TeardownPodNetwork for the same reason. This deadloack was previously addressed by checking if CNIResult == nil and ignoring cleanup error.
  2. When SetupPodNetwork fails partially in the second half of the CNI chain, TeardownPodNetwork might fail early on (CNI executes in the reverse order during teardown) and leak the network resource allocated by the first half. TeardownPodNetwork is then skipped for later retries because CNIResult == nil.

There are two places where TeardownPodNetwork are called:

  1. RunPodSandbox: When SetupPodNetwork fail, TeardownPodNetwork will be called in the defer func as the initial attempt to cleanup network resource.
  2. StopPodSanbox: TeardownPodNetwork is called along with the process to stop the sandbox.

In both above places, TeardownPodNetwork will be skipped (or ignore error) when CNIResult == nil. This check itself seems problematic because a half-executed SetupPodNetwork could have allocated network resource while resulted in a nil CNIResult, hence it is not always safe to skip or ignore the cleanup error with this condition. Otherwise it prevents further retry of TeardownPodNetwork from happening, eventually leading to leaked resource permanently (during a node lifecycle).

One tricky part I'm seeing in the implementation is - in RunPodSandbox, TeardownPodNetwork is conditionally called in the defer func based on local variables (retErr and cleanupErr). There is no obvious way to influence the execution of TeardownPodNetwork directly while handling the error returned by SetupPodNetwork (without major refactoring). The most straightforward way is to categorize retErr and conditionally calling TeardownPodNetwork - and that's what in this PR at the moment.

If I take a step back and think of a more intuitive way of handling this. Ideally I will adjust the Setup methods in go-cni lib to return more accurate result. For instance, if the CNI plugins were executed partially, the partial results should also be returned so that they can be stored by containerd. The decision on whether a cleanup is required should then based on CNIResult for real, rather than the random conditions we are having here.

Would love to see what thoughts you might have on this :)

@MrHohn MrHohn force-pushed the improve-network-teardown branch 2 times, most recently from ed67626 to e95ba1e Compare August 4, 2025 19:27
@samuelkarp samuelkarp moved this from Needs Triage to Needs Reviewers in Pull Request Review Aug 19, 2025
@MrHohn
Copy link
Copy Markdown
Author

MrHohn commented Aug 25, 2025

Bumping this up again - hoping to bring this back on track.

The aim of this fix is to really stop the bleeding while minimizing the changes needed. We are keen to resolve the IP leaking issue and making sure the resolution being backported to the impacted branches in a timely manner.

Yes, there are various options to "correct" the behavior (e.g. #12130 (comment)) outside of what's proposed here. I'm open for any advice and feedback on this. Thanks!

@chrishenzie
Copy link
Copy Markdown
Member

Thanks for your work on this PR and patience in getting it reviewed.

To the maintainers, this PR is an important fix for the GKE team. We've had customers report IP address leaks on their nodes, and our investigation traced the root cause to the change in network teardown logic in #10744. Given the customer impact, we would appreciate it if we could get this reviewed and merged as soon as possible.

@mikebrow, @estesp, @aojea, could you please take a look when you have a moment?

@mikebrow mikebrow force-pushed the improve-network-teardown branch from e95ba1e to 0d13443 Compare January 30, 2026 18:01
Copy link
Copy Markdown
Member

@mikebrow mikebrow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mikebrow mikebrow added cherry-pick/1.7.x Change to be cherry picked to release/1.7 branch cherry-pick/2.1.x Change to be cherry picked to release/2.1 branch cherry-pick/2.2.x Change to be cherry picked to release/2.2 branch labels Jan 30, 2026
@MrHohn
Copy link
Copy Markdown
Author

MrHohn commented Feb 3, 2026

The test failure looks like flake reported here: #12260

/retest

This is a follow up of both containerd#10744 and containerd#10839, where teardownPodNetwork
may be skipped if setup networks failed in the first place.

In the situation illustrated on containerd#12130, for chained CNI pattern the
skipping logic may be problematic and can lead to network resource
leakage (e.g. Pod IP).

This PR attempts to make the sandbox network handling more robust by
conditionally skipping the teardown network only when error is related
to CNI initialization, while allowing teardown retry to happen if setup
already allocated network resoruce partially.

Signed-off-by: Zihong Zheng <[email protected]>
@mikebrow mikebrow force-pushed the improve-network-teardown branch from 0d13443 to 2b67a18 Compare February 3, 2026 21:20
@k8s-ci-robot
Copy link
Copy Markdown

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/cri Container Runtime Interface (CRI) cherry-pick/1.7.x Change to be cherry picked to release/1.7 branch cherry-pick/2.1.x Change to be cherry picked to release/2.1 branch cherry-pick/2.2.x Change to be cherry picked to release/2.2 branch needs-rebase ok-to-test size/S

Projects

Status: Needs Reviewers

Development

Successfully merging this pull request may close these issues.

7 participants