CRI: Fix no CNI info for pod sandbox on restart#7845
CRI: Fix no CNI info for pod sandbox on restart#7845samuelkarp merged 1 commit intocontainerd:mainfrom
Conversation
containerd/containerd#7845 Signed-off-by: Natanael Copa <[email protected]>
cf6bb08 to
ac22c9f
Compare
containerd/containerd#7845 Signed-off-by: Natanael Copa <[email protected]>
|
Funnily enough I didn't need to change sbserver, as it was already doing a double update |
|
"I think RunPodSandbox is annoyingly hairy.. It's kind of scary how even the smallest refactorings in this code can cause regressions that aren't immediately noticeable and are even harder to catch in review" <--- We should tackle this. That code needs some love. Is the code in a state where its easy to create a test for this regression? Thanks for tackling this |
|
The fix LGTM. Sorry for the miss. |
|
/test pull-containerd-sandboxed-node-e2e |
Due to when we were updating the pod sandboxes underlying container object, the pointer to the sandbox would have the right info, but the on-disk representation of the data was behind. This would cause the data returned from loading any sandboxes after a restart to have no CNI result or IP information for the pod. This change does an additional update to the on-disk container info right after we invoke the CNI plugin so the metadata for the CNI result and other networking information is properly flushed to disk. Signed-off-by: Danny Canter <[email protected]>
ac22c9f to
3ee6dd5
Compare
|
/test pull-containerd-sandboxed-node-e2e |
|
I'll handle the backport to 1.6 and 1.5 after merge, thanks all |
This was not a small refactoring, IMO, and it was done async with a couple other refactoring efforts. Arguably, it was a bit less unwieldy before the sum of changes https://github.com/containerd/cri/blob/release/1.4/pkg/server/sandbox_run.go#L53-L315 Obvious statement: The pod container updates were not originally needed because the pod container was created after the network setup, there was no need to warn future programmers not to move it or if you do don't forget to update it. More coming here to support user namespaces.. as the user namespace change will use runc create to setup the network namespace in user, giving us 3 network setup modes (host, pod, user). Maybe open an issue for: cleaning this set of functions up, without changing behavior, a network setup state diagram so programmers can follow along, and check for a proper set of test cases covering the variations of each mode. |
Should've been more clear, wasn't referring to any of the recent changes here, more so any changes anyone could make around this setup code in general. You can move some of these Updates/state saves a couple lines up/down as shown here and have some edge case explode. It's hard to keep in your head during review or development everything that can go wrong given everything that someone could do/could happen to the daemon. I love the idea about diagrams for net setup, perhaps these already exist somewhere but may need to be updated after the re-ordering work |
|
@MikeZappa87 has been working on some network diagrams.. Yes, the totality of cross dependencies between the node components is ... well there is a lot that can go wrong. |
containerd/containerd#7845 Signed-off-by: Natanael Copa <[email protected]>
|
Did this get backported to 1.6.x yet? I am seeing that v1.6.9 through v1.6.12 are all affected by this. |
|
@brandond it will be in v1.6.15, there is no time set for it yet, but we can do it anytime if needed. |
containerd/containerd#7845 Signed-off-by: Natanael Copa <[email protected]>
|
Does this need to go back to 1.5.16 as well? It originally looked like we weren't seeing this on 1.5 but our QA team has turned up an issue that looks very similar. |
Fixes #7843
Due to when we were updating the pod sandboxes underlying container
object, the pointer to the sandbox would have the right info, but
the on-disk representation of the data was behind. This would cause
the data returned from loading any sandboxes after a restart to have
no CNI result or IP information for the pod.
This change does an additional update to the on-disk container info
right after we invoke the CNI plugin so the metadata for the CNI result
and other networking information is properly flushed to disk.
Note:
I think RunPodSandbox is annoyingly hairy.. It's kind of scary how even the smallest refactorings in this code can cause regressions that aren't immediately noticeable, and are even harder to catch in review