Skip to content

CRI: Fix no CNI info for pod sandbox on restart#7845

Merged
samuelkarp merged 1 commit intocontainerd:mainfrom
dcantah:fix-noip-onrestart
Dec 20, 2022
Merged

CRI: Fix no CNI info for pod sandbox on restart#7845
samuelkarp merged 1 commit intocontainerd:mainfrom
dcantah:fix-noip-onrestart

Conversation

@dcantah
Copy link
Copy Markdown
Member

@dcantah dcantah commented Dec 20, 2022

Fixes #7843

Due to when we were updating the pod sandboxes underlying container
object, the pointer to the sandbox would have the right info, but
the on-disk representation of the data was behind. This would cause
the data returned from loading any sandboxes after a restart to have
no CNI result or IP information for the pod.

This change does an additional update to the on-disk container info
right after we invoke the CNI plugin so the metadata for the CNI result
and other networking information is properly flushed to disk.

Note:

I think RunPodSandbox is annoyingly hairy.. It's kind of scary how even the smallest refactorings in this code can cause regressions that aren't immediately noticeable, and are even harder to catch in review

@dcantah
Copy link
Copy Markdown
Member Author

dcantah commented Dec 20, 2022

cc @qiutongs, won't let me add you as a reviewer. I believe this was a side effect from #5904

ncopa added a commit to ncopa/k0s that referenced this pull request Dec 20, 2022
@dcantah dcantah marked this pull request as draft December 20, 2022 10:47
@dcantah dcantah marked this pull request as ready for review December 20, 2022 11:13
ncopa added a commit to ncopa/k0s that referenced this pull request Dec 20, 2022
@dcantah
Copy link
Copy Markdown
Member Author

dcantah commented Dec 20, 2022

Funnily enough I didn't need to change sbserver, as it was already doing a double update

@MikeZappa87
Copy link
Copy Markdown
Member

MikeZappa87 commented Dec 20, 2022

"I think RunPodSandbox is annoyingly hairy.. It's kind of scary how even the smallest refactorings in this code can cause regressions that aren't immediately noticeable and are even harder to catch in review" <--- We should tackle this. That code needs some love. Is the code in a state where its easy to create a test for this regression?

Thanks for tackling this

Comment thread pkg/cri/server/sandbox_run.go
@qiutongs
Copy link
Copy Markdown
Contributor

The fix LGTM. Sorry for the miss.

@samuelkarp
Copy link
Copy Markdown
Member

/test pull-containerd-sandboxed-node-e2e

Comment thread pkg/cri/server/sandbox_run.go Outdated
Due to when we were updating the pod sandboxes underlying container
object, the pointer to the sandbox would have the right info, but
the on-disk representation of the data was behind. This would cause
the data returned from loading any sandboxes after a restart to have
no CNI result or IP information for the pod.

This change does an additional update to the on-disk container info
right after we invoke the CNI plugin so the metadata for the CNI result
and other networking information is properly flushed to disk.

Signed-off-by: Danny Canter <[email protected]>
Copy link
Copy Markdown
Member

@mikebrow mikebrow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@samuelkarp
Copy link
Copy Markdown
Member

/test pull-containerd-sandboxed-node-e2e

@dcantah
Copy link
Copy Markdown
Member Author

dcantah commented Dec 20, 2022

I'll handle the backport to 1.6 and 1.5 after merge, thanks all

@mikebrow
Copy link
Copy Markdown
Member

mikebrow commented Dec 20, 2022

I think RunPodSandbox is annoyingly hairy.. It's kind of scary how even the smallest refactorings in this code can cause regressions that aren't immediately noticeable, and are even harder to catch in review

This was not a small refactoring, IMO, and it was done async with a couple other refactoring efforts. Arguably, it was a bit less unwieldy before the sum of changes https://github.com/containerd/cri/blob/release/1.4/pkg/server/sandbox_run.go#L53-L315

Obvious statement: The pod container updates were not originally needed because the pod container was created after the network setup, there was no need to warn future programmers not to move it or if you do don't forget to update it.

More coming here to support user namespaces.. as the user namespace change will use runc create to setup the network namespace in user, giving us 3 network setup modes (host, pod, user).

Maybe open an issue for: cleaning this set of functions up, without changing behavior, a network setup state diagram so programmers can follow along, and check for a proper set of test cases covering the variations of each mode.

@dcantah
Copy link
Copy Markdown
Member Author

dcantah commented Dec 20, 2022

This was not a small refactoring, IMO, and it was done async with a couple other refactoring efforts.

Should've been more clear, wasn't referring to any of the recent changes here, more so any changes anyone could make around this setup code in general. You can move some of these Updates/state saves a couple lines up/down as shown here and have some edge case explode. It's hard to keep in your head during review or development everything that can go wrong given everything that someone could do/could happen to the daemon.

I love the idea about diagrams for net setup, perhaps these already exist somewhere but may need to be updated after the re-ordering work

@samuelkarp samuelkarp merged commit 3233d5d into containerd:main Dec 20, 2022
@mikebrow
Copy link
Copy Markdown
Member

@MikeZappa87 has been working on some network diagrams.. Yes, the totality of cross dependencies between the node components is ... well there is a lot that can go wrong.

@dmcgowan dmcgowan added cherry-picked/1.5.x PR commits are cherry-picked into release/1.5 branch cherry-picked/1.6.x PR commits are cherry-picked into release/1.6 branch and removed cherry-pick/1.5.x labels Dec 20, 2022
ncopa added a commit to ncopa/k0s that referenced this pull request Dec 21, 2022
@brandond
Copy link
Copy Markdown
Contributor

brandond commented Jan 4, 2023

Did this get backported to 1.6.x yet? I am seeing that v1.6.9 through v1.6.12 are all affected by this.

@dmcgowan
Copy link
Copy Markdown
Member

dmcgowan commented Jan 4, 2023

@brandond it will be in v1.6.15, there is no time set for it yet, but we can do it anytime if needed.

@brandond
Copy link
Copy Markdown
Contributor

Does this need to go back to 1.5.16 as well? It originally looked like we weren't seeing this on 1.5 but our QA team has turned up an issue that looks very similar.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/cri Container Runtime Interface (CRI) cherry-picked/1.5.x PR commits are cherry-picked into release/1.5 branch cherry-picked/1.6.x PR commits are cherry-picked into release/1.6 branch impact/changelog kind/bug

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

CRI: Sandbox IP not present after containerd restart

8 participants