Skip to content

Update container with sandbox metadata after NetNS is created#7481

Merged
estesp merged 1 commit intocontainerd:mainfrom
qiutongs:netns-fix
Oct 10, 2022
Merged

Update container with sandbox metadata after NetNS is created#7481
estesp merged 1 commit intocontainerd:mainfrom
qiutongs:netns-fix

Conversation

@qiutongs
Copy link
Copy Markdown
Contributor

@qiutongs qiutongs commented Oct 5, 2022

This is to fix a bug introduced in #5904.

When the network namespace is created, we need to update both container spec and sandbox metadata so that both information can persist on disk. If sandbox metadata is missing, the NetNS is considered as closed when loading sandbox during restart recovery because the NetNS.path is empty. Closed NetNS will affect network teardown in sandbox stop.

Manual Testing

  1. Add a sleep in setupPodNetwork, e.g. 60s
  2. sudo crictl runp sandbox.json
  3. Restart containerd
  4. sudo crictl pods shows the

Before fix: sudo crictl inspectp <POD ID> shows "netNamespaceClosed": true

After fix: sudo crictl inspectp <POD ID> shows "netNamespaceClosed": false

New Integration Test

ENABLE_CRI_SANDBOXES="" CONTAINERD_RUNTIME=runc FOCUS=TestRunPodSandboxWithShimStartAndTeardownCNISlow make cri-integration

@qiutongs
Copy link
Copy Markdown
Contributor Author

qiutongs commented Oct 5, 2022

/cc @fuweid @samuelkarp @dmcgowan

@k8s-ci-robot
Copy link
Copy Markdown

Hi @qiutongs. Thanks for your PR.

I'm waiting for a containerd member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@samuelkarp samuelkarp added the area/cri Container Runtime Interface (CRI) label Oct 6, 2022
@samuelkarp
Copy link
Copy Markdown
Member

/ok-to-test

Comment thread pkg/cri/server/sandbox_run.go Outdated
Comment thread pkg/cri/server/sandbox_run.go Outdated
@samuelkarp
Copy link
Copy Markdown
Member

samuelkarp commented Oct 6, 2022

updateContainerSpec (in pkg/cri/server/container_update_resources_other.go) is now no longer called and is causing linter errors. That will need to be removed.

@samuelkarp samuelkarp added this to the 1.7 milestone Oct 6, 2022
@qiutongs qiutongs force-pushed the netns-fix branch 3 times, most recently from f9884df to 3c3711c Compare October 6, 2022 13:47
Comment thread integration/sandbox_run_rollback_test.go Outdated
Comment thread integration/sandbox_run_rollback_test.go Outdated
Comment thread integration/sandbox_run_rollback_test.go Outdated
Comment thread integration/sandbox_run_rollback_test.go Outdated
Comment thread integration/sandbox_run_rollback_test.go Outdated
Comment thread pkg/cri/server/container_update_resources_other.go
Comment thread integration/sandbox_run_rollback_test.go Outdated
Comment thread integration/sandbox_run_rollback_test.go Outdated
Comment thread integration/sandbox_run_rollback_test.go Outdated
Comment thread integration/sandbox_run_rollback_test.go Outdated
Comment thread integration/sandbox_run_rollback_test.go Outdated

time.Sleep(5 * time.Second)
t.Log("Restart containerd after 5s")
RestartContainerd(t)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

synced with @qiutongs last night. We need to send sigkill to prevent containerd from graceful shutdown. The graceful shutdown will cancel the context to the inflight rpc call, which might invoke the defer to cleanup the resource.

@qiutongs
Copy link
Copy Markdown
Contributor Author

qiutongs commented Oct 9, 2022

[It] runtime should support portforward in host network

Timed out after 60.000s.
  Expected
      <*url.Error | 0xc0007b4ab0>: {
          Op: "Get",
          URL: "http://127.0.0.1:12002/",
          Err: <*net.OpError | 0xc0007c2190>{
              Op: "dial",
              Net: "tcp",
              Source: nil,
              Addr: <*net.TCPAddr | 0xc0007b4a80>{
                  IP: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 255, 255, 127, 0, 0, 1],
                  Port: 12002,
                  Zone: "",
              },
              Err: <*os.SyscallError | 0xc00014b4a0>{
                  Syscall: "connect",
                  Err: <syscall.Errno>0x6f,
              },
          },
      }
  to be nil
  In [It] at: github.com/kubernetes-sigs/cri-tools/pkg/validate/networking.go:253

@qiutongs
Copy link
Copy Markdown
Contributor Author

qiutongs commented Oct 9, 2022

[It] runtime should support portforward in host network

Timed out after 60.000s.
  Expected
      <*url.Error | 0xc0007b4ab0>: {
          Op: "Get",
          URL: "http://127.0.0.1:12002/",
          Err: <*net.OpError | 0xc0007c2190>{
              Op: "dial",
              Net: "tcp",
              Source: nil,
              Addr: <*net.TCPAddr | 0xc0007b4a80>{
                  IP: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 255, 255, 127, 0, 0, 1],
                  Port: 12002,
                  Zone: "",
              },
              Err: <*os.SyscallError | 0xc00014b4a0>{
                  Syscall: "connect",
                  Err: <syscall.Errno>0x6f,
              },
          },
      }
  to be nil
  In [It] at: github.com/kubernetes-sigs/cri-tools/pkg/validate/networking.go:253

@qiutongs
Copy link
Copy Markdown
Contributor Author

qiutongs commented Oct 9, 2022

/retest

@qiutongs qiutongs requested review from fuweid and samuelkarp and removed request for dmcgowan, fuweid and samuelkarp October 9, 2022 14:54
Copy link
Copy Markdown
Member

@fuweid fuweid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Copy Markdown
Member

@estesp estesp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@estesp estesp merged commit 9b0ab84 into containerd:main Oct 10, 2022
Copy link
Copy Markdown
Member

@samuelkarp samuelkarp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for the late review, but LGTM. One small nit remains, but that can be addressed separately.

Comment thread script/test/utils.sh
Comment on lines +278 to +280
# The command may return non-zero and we want to continue this script.
# e.g. containerd receives SIGKILL
set +e
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bash options (which are set with set) do not have a function scope, so this set +e affects execution for all commands afterward. There's no set -e (or set -o errexit) in the script, so we're not changing behavior, but it'd be better to move this to the top of the script so it's clear that it applies to everything.

Here's a little script demonstrating:

#!/bin/bash

# This should exit non-zero, but not cause the script to exit.
false
echo "false: $?"

# Abort the script when the first error occurs
set -e
echo "set -e"

# false # This would cause the script to abort

myfunc() {
  echo "myfunc"
  # Allow errors
  set +e
  false
  echo "false: $?"
}

# call myfunc
myfunc

echo "calling false again"
false
echo "false: $?"

echo "If 'set -e' were in effect, we wouldn't reach this line."

set -e
false
echo "Unreachable"

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great suggestion. Captured in #7441

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The keepalive is running in background, which is in other process. It doesn't impact the following statements. But it needs doc or uses separate bash script to ship it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/cri Container Runtime Interface (CRI) cherry-picked/sbserver Changes are backported to sbserver cherry-picked/1.6.x PR commits are cherry-picked into release/1.6 branch kind/bug ok-to-test

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants