Skip to content

Conversation

@qiutongs
Copy link
Contributor

@qiutongs qiutongs commented Aug 23, 2021

Signed-off-by: Qiutong Song [email protected]

Overview

Fix issue #5768

Change the order to:

  1. Create network namespace
  2. Create sandbox container
  3. Setup pod network
  4. Create sandbox container task

Testing

make
make test
sudo make root-test
sudo make integration test

Local Verification

Simulate IP Leakage with Chained Plugin

I can simulate ip leakage case by modifying the CNI config /etc/cni/net.d/10-containerd-net.conflist.

Add a dummy plugin that doesn't exist.

{
      "type": "dummy",
}

Thus, the RundPodSandbox failed with error

ERRO[2022-04-06T08:02:59.236331342-07:00] RunPodSandbox for &PodSandboxMetadata{Name:nginx-pod,Uid:d52d99d3-b4d7-4715-a0fa-4aaca8218de1,Namespace:default,Attempt:0,} failed, error  error="failed to setup network for sandbox \"949d785ccf3d121f65407690ee2adf80d2b24f788f97cc688ed5ab4f049b7309\": failed to find plugin \"dummy\" in path [/opt/cni/bin]"
ERRO[2022-04-06T08:02:59.263476902-07:00] StopPodSandbox for "949d785ccf3d121f65407690ee2adf80d2b24f788f97cc688ed5ab4f049b7309" failed  error="failed to destroy network for sandbox \"949d785ccf3d121f65407690ee2adf80d2b24f788f97cc688ed5ab4f049b7309\": failed to find plugin \"dummy\" in path [/opt/cni/bin]"

Before the Fix

Kubelet keeps calling RunPodSandbox and each time it reserves a new IP. All the IPs are not released. This can be observed with below command.

sudo ls /var/lib/cni/networks/containerd-net

After the Fix

Kubelet keeps calling StopPodSandbox after the initial RunPodSandbox. Even though all StopPodSandbox also fail, the IP is not leaked (watching /var/lib/cni/networks/containerd-net as above).

Pod event

Normal   SandboxChanged          12m (x16 over 15m)  kubelet            Pod sandbox changed, it will be killed and re-created.

New Integration Test with Failpoint

# Need CNI 1.0+
$ script/setup/install-failpoint-binaries
$ ENABLE_CRI_SANDBOXES="" CONTAINERD_RUNTIME=runc FOCUS=TestRunPodSandboxWithShimStartAndTealdownCNIFailure make cri-integration
$ ENABLE_CRI_SANDBOXES="" CONTAINERD_RUNTIME=runc FOCUS=TestRunPodSandboxWithShimDeleteFailure make cri-integration

Note: sudo rm -rf /var/lib/containerd-test to cleanup the states if the test fails. Otherwise, the tests will fail with "failed to reserve sandbox name" in the subsequent runs.

New Integration Test

$ FOCUS=TestNetworkTeardownFailureWithoutIPLeakage make cri-integration
INFO[0000] Using the following image list: {Alpine:docker.io/library/alpine:latest BusyBox:docker.io/library/busybox:latest Pause:k8s.gcr.io/pause:3.6 ResourceConsumer:k8s.gcr.io/e2e-test-images/resource-consumer:1.10 VolumeCopyUp:ghcr.io/containerd/volume-copy-up:2.1 VolumeOwnership:ghcr.io/containerd/volume-ownership:2.1} 
=== RUN   TestNetworkTeardownFailureWithoutIPLeakage
    sandbox_clean_remove_test.go:136: Make sure host-local ipam and portmap are in use
    sandbox_clean_remove_test.go:152: Rename the portmap binary
    sandbox_clean_remove_test.go:157: Count the numbers of IPs allocated currently
    sandbox_clean_remove_test.go:163: Create a sandbox
E0507 19:38:18.844400 1848990 remote_runtime.go:135] RunPodSandbox from runtime service failed: rpc error: code = Unknown desc = failed to setup network for sandbox "accdc40e1fbdc2d305634485c0246d0130ed016a2d53cbb4353803b0d2eccc73": plugin type="portmap" failed (add): failed to find plugin "portmap" in path [/opt/cni/bin]
    sandbox_clean_remove_test.go:189: Get pod information
    sandbox_clean_remove_test.go:206: [{pid } {ipc } {uts } {mount } {network /var/run/netns/cni-8171a6cd-01bc-7e96-55d6-a5fd1e953717}]
    sandbox_clean_remove_test.go:215: Count the numbers of IPs allocated after teardown failure
    sandbox_clean_remove_test.go:224: Should be able to stop and remove the sandbox
    sandbox_clean_remove_test.go:228: Count the numbers of IPs allocated after removing the sandbox
--- PASS: TestNetworkTeardownFailureWithoutIPLeakage (2.21s)
PASS

Before the fix, the test fails

    sandbox_clean_remove_test.go:198: 
                Error Trace:    sandbox_clean_remove_test.go:198
                Error:          Should NOT be empty, but was 
                Test:           TestNetworkTeardownFailureWithoutIPLeakage
                Messages:       sandboxId should be found

@k8s-ci-robot
Copy link

Hi @qiutongs. Thanks for your PR.

I'm waiting for a containerd member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@qiutongs
Copy link
Contributor Author

/assign @Random-Liu

@theopenlab-ci
Copy link

theopenlab-ci bot commented Aug 23, 2021

Build succeeded.

@Random-Liu
Copy link
Member

Random-Liu commented Sep 9, 2021

With this change, even though the pod network setup and teardown are moved down, our problem is still not resolved:

  1. The sandbox container metadata will still be deleted from the containerd metadata store in the defer;
  2. The sandbox is not added into the sandbox store, so will not be listed by kubelet, thus no cleanup.

We can probably add the sandbox into the sandbox store in unknown state if there is any sandbox startup problem (remove the defer cleanup), and rely on kubelet to stop and remove the sandbox afterwards. A similar approach can be used for the container creation as well.

The assumption is that sandbox stop should be idempotent.

One thing could be missing with the approach above is selinux.ReleaseLabel(sandbox.ProcessLabel). I'm not sure why it is not included in sandbox stop today.

@mikebrow mikebrow added the status/needs-update Awaiting contributor update label Oct 11, 2021
@mikebrow
Copy link
Member

@qiutongs note @Random-Liu 's request for change

@estesp
Copy link
Member

estesp commented Oct 28, 2021

any update here based on the feedback @qiutongs?

@qiutongs
Copy link
Contributor Author

qiutongs commented Apr 4, 2022

Sorry, I paused this. Now I am prioritizing the fix. Will update asap.

@mikebrow
Copy link
Member

mikebrow commented Apr 4, 2022

Sorry, I paused this. Now I am prioritizing the fix. Will update asap.

that works I was poking around at this at the end of last week.. for performance reasons I was thinking we could run the pause container and cni setup in parallel

wip commit over here:
mikebrow@17cb9da

they both take about the same amount of time.. thus running them in parallel will essentially make the pause container just a resource hit. Still have to resolve the idempotent issue.

@qiutongs
Copy link
Contributor Author

qiutongs commented Apr 5, 2022

@qiutongs qiutongs closed this Apr 5, 2022
@qiutongs qiutongs reopened this Apr 5, 2022
@theopenlab-ci
Copy link

theopenlab-ci bot commented Apr 5, 2022

Merge Failed.

This change or one of its cross-repo dependencies was unable to be automatically merged with the current state of its repository. Please rebase the change and upload a new patchset.

@kevpar
Copy link
Member

kevpar commented Apr 5, 2022

I think this will break VM-isolated containers on Windows. We rely on the CNI running before the pause container is started, since CNI creates the HCN endpoint, and we look for the HCN endpoint when creating the UVM, which is done as part of the pause container creation.

@kevpar
Copy link
Member

kevpar commented Apr 5, 2022

@dcantah can you take a look to confirm if this will affect the vm-isolated container work you are doing?

@mikebrow
Copy link
Member

mikebrow commented Apr 5, 2022

I think this will break VM-isolated containers on Windows. We rely on the CNI running before the pause container is started, since CNI creates the HCN endpoint, and we look for the HCN endpoint when creating the UVM, which is done as part of the pause container creation.

would you be ok if cni setup and the create/start of the pause container going on in parallel if you were to add a wait check for the endpoint to exist?

@dcantah
Copy link
Member

dcantah commented Apr 6, 2022

would you be ok if cni setup and the create/start of the pause container going on in parallel if you were to add a wait check for the endpoint to exist?

@mikebrow Are you saying if we could make CNI and the pause container operations start at the same time would it be alright on our end if we can just busy loop until an endpoint shows up (e.g. until CNI finishes)? I guess I'm asking if you mean in our shim would we be fine looping for the existence of an ep

@qiutongs qiutongs closed this Apr 6, 2022
@qiutongs qiutongs reopened this Apr 6, 2022
@qiutongs
Copy link
Contributor Author

qiutongs commented Apr 6, 2022

With this change, even though the pod network setup and teardown are moved down, our problem is still not resolved:

  1. The sandbox container metadata will still be deleted from the containerd metadata store in the defer;
  2. The sandbox is not added into the sandbox store, so will not be listed by kubelet, thus no cleanup.

We can probably add the sandbox into the sandbox store in unknown state if there is any sandbox startup problem (remove the defer cleanup), and rely on kubelet to stop and remove the sandbox afterwards. A similar approach can be used for the container creation as well.

The assumption is that sandbox stop should be idempotent.

One thing could be missing with the approach above is selinux.ReleaseLabel(sandbox.ProcessLabel). I'm not sure why it is not included in sandbox stop today.

I see what I missed. I updated my change based on your idea.
Working on to reproduce this locally.

@theopenlab-ci
Copy link

theopenlab-ci bot commented Apr 6, 2022

Build succeeded.

@qiutongs
Copy link
Contributor Author

qiutongs commented Apr 6, 2022

I have successfully simulated the IP leakage case and verified this fix works for the case. See #5904 (comment)

@mikebrow
Copy link
Member

mikebrow commented Apr 6, 2022

would you be ok if cni setup and the create/start of the pause container going on in parallel if you were to add a wait check for the endpoint to exist?

@mikebrow Are you saying if we could make CNI and the pause container operations start at the same time would it be alright on our end if we can just busy loop until an endpoint shows up (e.g. until CNI finishes)? I guess I'm asking if you mean in our shim would we be fine looping for the existence of an ep

well yes.. not for a long period of time.. creating a pod is down to about 250mills on my old bare metal server assuming we are not resource constrained, moving to parallel will make that 180mils, again assuming not resource constrained. A few fixes that went in brought pod start down from a couple seconds for our default containerd/cni setup/install. We could alternatively just serialize for runtimes that have a dependency on the network setup completing before the pause container is created/started..

@mikebrow
Copy link
Member

mikebrow commented Apr 6, 2022

With this change, even though the pod network setup and teardown are moved down, our problem is still not resolved:

  1. The sandbox container metadata will still be deleted from the containerd metadata store in the defer;
  2. The sandbox is not added into the sandbox store, so will not be listed by kubelet, thus no cleanup.

We can probably add the sandbox into the sandbox store in unknown state if there is any sandbox startup problem (remove the defer cleanup), and rely on kubelet to stop and remove the sandbox afterwards. A similar approach can be used for the container creation as well.
The assumption is that sandbox stop should be idempotent.
One thing could be missing with the approach above is selinux.ReleaseLabel(sandbox.ProcessLabel). I'm not sure why it is not included in sandbox stop today.

I see what I missed. I updated my change based on your idea. Working on to reproduce this locally.

I think Lantau was pushing for a different approach entirely to the moving network setup down.. I think he's saying set the pod status to unknown .. that will/should cause the pod to be destroyed by kubelet with a stop and remove. Thus we should be able to remove most of the defer logic.

@dcantah
Copy link
Member

dcantah commented Apr 6, 2022

We could alternatively just serialize for runtimes that have a dependency on the network setup completing before the pause container is created/started..

This doesn't seem like a bad idea either, but I'm not sure if the approach in the PR currently is the final one.. With what's in the diff now at least, I don't think this would break Windows. As long as the network is up and running before we start the task we should be alright.

@qiutongs
Copy link
Contributor Author

qiutongs commented Apr 8, 2022

I think Lantau was pushing for a different approach entirely to the moving network setup down.. I think he's saying set the pod status to unknown .. that will/should cause the pod to be destroyed by kubelet with a stop and remove. Thus we should be able to remove most of the defer logic.

I synced with Lantau offline. He agreed that we could do minimal changes now which is still attempting to cleanup. But if the cleanup fails, we rely on Kubelet to call StopPodSandbox to cleanup.

In my change, the sandbox object will remain the "unknown" status and be added to store.

sandboxstore.Status{
	State: sandboxstore.StateUnknown,
},

@samuelkarp
Copy link
Member

@qiutongs Would you mind fixing DCO on the commit here so the rest of the CI checks can run?

Copy link
Member

@mikebrow mikebrow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (late) with the listed followups.. I think progress is made.

@samuelkarp samuelkarp added cherry-picked/1.6.x PR commits are cherry-picked into release/1.6 branch and removed cherry-pick/1.6.x labels Oct 4, 2022
@estesp estesp added cherry-picked/1.5.x PR commits are cherry-picked into release/1.5 branch and removed cherry-pick/1.5.x labels Oct 28, 2022
@samuelkarp samuelkarp added the cherry-picked/sbserver Changes are backported to sbserver label Nov 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/cri Container Runtime Interface (CRI) cherry-picked/sbserver Changes are backported to sbserver cherry-picked/1.5.x PR commits are cherry-picked into release/1.5 branch cherry-picked/1.6.x PR commits are cherry-picked into release/1.6 branch impact/changelog ok-to-test

Projects

None yet

Development

Successfully merging this pull request may close these issues.