Handle pod network teardown more carefully when setup failed.#12132
Handle pod network teardown more carefully when setup failed.#12132MrHohn wants to merge 1 commit intocontainerd:mainfrom
Conversation
|
@MrHohn: GitHub didn't allow me to request PR reviews from the following users: aojea, sameersaeed. Note that only containerd members and repo collaborators can review this PR, and authors cannot review their own PRs. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
Hi @MrHohn. Thanks for your PR. I'm waiting for a containerd member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
35f4699 to
9ba9f99
Compare
|
Thanks @MrHohn, I tested creating a new pod and it seems to work as expected - cannot create a new pod when no CNI plugins are initialized: |
|
/cc @estesp |
|
/ok-to-test |
|
From the pull-containerd-node-e2e failures - many of them seem irrelevant to the changes here. /retest |
9ba9f99 to
1c77f9a
Compare
| return fmt.Errorf("failed to destroy network for sandbox %q: %w", id, err) | ||
| } | ||
| if err := c.teardownPodNetwork(ctx, sandbox); err != nil { | ||
| return fmt.Errorf("failed to destroy network for sandbox %q: %w", id, err) |
There was a problem hiding this comment.
why you should call the CNI plugin if you didn;t get any result before?
There was a problem hiding this comment.
it seems that previous if sandbox.CNIResult != nil { is ok
There was a problem hiding this comment.
I reverted this change because CNIResult will only be set if setupPodNetwork succeeded. In cases where both the setupPodNetwork and teardownPodNetwork failed during a sandbox creation, we will end up not retrying the teardownPodNetwork despite it failed previously.
There was a problem hiding this comment.
yeah, that is why I added this comment #12132 (comment) , the setupPodNetworkerror can be handled directly after the output of that function and not in this magic defer that is so complex to understand ... it also depend on two magic error variables that makes it more confusing
There was a problem hiding this comment.
There are a few more thoughts behind this - trying to merge the conversation into this thread: #12132 (comment).
| if cleanupErr = c.teardownPodNetwork(deferCtx, sandbox); cleanupErr != nil { | ||
| log.G(ctx).WithError(cleanupErr).Errorf("Failed to destroy network for sandbox %q", id) | ||
|
|
||
| // ignoring failed to destroy networks when we failed to setup networks | ||
| if sandbox.CNIResult == nil { | ||
| // Ignoring network cleanup error if the CNI config/plugin failed to be initialized. | ||
| // It is safe to ignore in this scenario as no network plugins should have been executed. | ||
| if errors.Is(cleanupErr, ErrCNIConfigNotInitialized) || | ||
| errors.Is(cleanupErr, cni.ErrCNINotInitialized) { | ||
| cleanupErr = nil |
There was a problem hiding this comment.
what makes this complex to think about is to embed all the logic into this big defer, I think that is better to compartmentalize this logic and handle the errors that come from c.setupPodNetwork directly there, if the network plugin is not initialized you don't have to call teardownPodNetwork and just return an error and this remains the same if sandbox.CNIResult == nil
There was a problem hiding this comment.
Thanks for the comment Antonio. +1 that baking more complex logic into this defer block makes it hard to reason about the behavior. I will take your advice and see how to handle error from c.setupPodNetwork directly.
There was a problem hiding this comment.
(Sorry for the long comment! I was thinking out loud while amending the codes.)
Coming back to this - I looked around how TeardownPodNetwork() is triggered and didn't find a ideal way yet to handle this. Would like to probe for more inputs here.
So there are two separate issues this PR is focusing on (the aim is to address both):
- When
SetupPodNetworkfails completely (e.g. due to CNI not initialized), the sandbox deletion will be blocked byTeardownPodNetworkfor the same reason. This deadloack was previously addressed by checkingif CNIResult == niland ignoring cleanup error. - When
SetupPodNetworkfails partially in the second half of the CNI chain,TeardownPodNetworkmight fail early on (CNI executes in the reverse order during teardown) and leak the network resource allocated by the first half.TeardownPodNetworkis then skipped for later retries becauseCNIResult == nil.
There are two places where TeardownPodNetwork are called:
RunPodSandbox: WhenSetupPodNetworkfail,TeardownPodNetworkwill be called in the defer func as the initial attempt to cleanup network resource.StopPodSanbox:TeardownPodNetworkis called along with the process to stop the sandbox.
In both above places, TeardownPodNetwork will be skipped (or ignore error) when CNIResult == nil. This check itself seems problematic because a half-executed SetupPodNetwork could have allocated network resource while resulted in a nil CNIResult, hence it is not always safe to skip or ignore the cleanup error with this condition. Otherwise it prevents further retry of TeardownPodNetwork from happening, eventually leading to leaked resource permanently (during a node lifecycle).
One tricky part I'm seeing in the implementation is - in RunPodSandbox, TeardownPodNetwork is conditionally called in the defer func based on local variables (retErr and cleanupErr). There is no obvious way to influence the execution of TeardownPodNetwork directly while handling the error returned by SetupPodNetwork (without major refactoring). The most straightforward way is to categorize retErr and conditionally calling TeardownPodNetwork - and that's what in this PR at the moment.
If I take a step back and think of a more intuitive way of handling this. Ideally I will adjust the Setup methods in go-cni lib to return more accurate result. For instance, if the CNI plugins were executed partially, the partial results should also be returned so that they can be stored by containerd. The decision on whether a cleanup is required should then based on CNIResult for real, rather than the random conditions we are having here.
Would love to see what thoughts you might have on this :)
ed67626 to
e95ba1e
Compare
|
Bumping this up again - hoping to bring this back on track. The aim of this fix is to really stop the bleeding while minimizing the changes needed. We are keen to resolve the IP leaking issue and making sure the resolution being backported to the impacted branches in a timely manner. Yes, there are various options to "correct" the behavior (e.g. #12130 (comment)) outside of what's proposed here. I'm open for any advice and feedback on this. Thanks! |
|
Thanks for your work on this PR and patience in getting it reviewed. To the maintainers, this PR is an important fix for the GKE team. We've had customers report IP address leaks on their nodes, and our investigation traced the root cause to the change in network teardown logic in #10744. Given the customer impact, we would appreciate it if we could get this reviewed and merged as soon as possible. @mikebrow, @estesp, @aojea, could you please take a look when you have a moment? |
e95ba1e to
0d13443
Compare
|
The test failure looks like flake reported here: #12260 /retest |
This is a follow up of both containerd#10744 and containerd#10839, where teardownPodNetwork may be skipped if setup networks failed in the first place. In the situation illustrated on containerd#12130, for chained CNI pattern the skipping logic may be problematic and can lead to network resource leakage (e.g. Pod IP). This PR attempts to make the sandbox network handling more robust by conditionally skipping the teardown network only when error is related to CNI initialization, while allowing teardown retry to happen if setup already allocated network resoruce partially. Signed-off-by: Zihong Zheng <[email protected]>
0d13443 to
2b67a18
Compare
|
PR needs rebase. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
This is a follow up of both #10744 and #10839, where
teardownPodNetworkmay be skipped if setup networks failed in the first place.In the situation illustrated on #12130, for chained CNI pattern the skipping logic may be problematic and leads to network resource leakage (e.g. Pod IP).
This PR attempts to make the sandbox network handling more robust by conditionally skipping the teardown network only when error is related to CNI initialization, while allowing teardown retry to happen if setup already allocated network resoruce partially.
/cc @sameersaeed @mikebrow @aojea