Skip to content

runtime v2: Close platform in runc shim's Shutdown method.#3896

Merged
crosbymichael merged 1 commit intocontainerd:masterfrom
sipsma:close-platform-once
Dec 19, 2019
Merged

runtime v2: Close platform in runc shim's Shutdown method.#3896
crosbymichael merged 1 commit intocontainerd:masterfrom
sipsma:close-platform-once

Conversation

@sipsma
Copy link
Copy Markdown
Contributor

@sipsma sipsma commented Dec 17, 2019

Previously, the platform was closed as part of the Delete method when the
process was an init for a task and there were no more tasks after its deletion.
This can create problems if another task is created within the shim right after
the delete runs, which results in the platform being closed but the shim
continuing to run.

This change moves closing the platform to the Shutdown method after the shim's
context is canceled, which ensures the platform is only closed once the shim
is sure its done servicing containers.

Signed-off-by: Erik Sipsma [email protected]

Fixes #3895

@theopenlab-ci
Copy link
Copy Markdown

theopenlab-ci Bot commented Dec 17, 2019

Build succeeded.

@estesp
Copy link
Copy Markdown
Member

estesp commented Dec 17, 2019

Hmm, I'm sure you are definitely correcting an important problem, but looks like we will have to figure out what criu doesn't like about this flow of operations:

--- FAIL: TestCheckpointRestorePTY (0.68s)
    container_checkpoint_test.go:104: runc did not terminate successfully: criu failed: type NOTIFY errno 0 path= /run/containerd-test/io.containerd.runtime.v2.task/testing/TestCheckpointRestorePTY/criu-dump.log: unknown
--- FAIL: TestCheckpointRestore (0.25s)
    container_checkpoint_test.go:218: runc did not terminate successfully: criu failed: type NOTIFY errno 0 path= /run/containerd-test/io.containerd.runtime.v2.task/testing/TestCheckpointRestore/criu-dump.log: unknown
--- FAIL: TestCheckpointRestoreNewContainer (0.25s)
    container_checkpoint_test.go:306: runc did not terminate successfully: criu failed: type NOTIFY errno 0 path= /run/containerd-test/io.containerd.runtime.v2.task/testing/TestCheckpointRestoreNewContainer/criu-dump.log: unknown
--- FAIL: TestCheckpointLeaveRunning (0.26s)
    container_checkpoint_test.go:398: runc did not terminate successfully: criu failed: type NOTIFY errno 0 path= /run/containerd-test/io.containerd.runtime.v2.task/testing/TestCheckpointLeaveRunning/criu-dump.log: unknown
--- FAIL: TestCRWithImagePath (0.20s)
    container_checkpoint_test.go:465: runc did not terminate successfully: criu failed: type NOTIFY errno 0 path= /run/containerd-test/io.containerd.runtime.v2.task/testing/TestCRWithImagePath-checkpoint/criu-dump.log: unknown
--- FAIL: TestContainerList (0.01s)
    container_test.go:70: expected 0 containers but received 5
FAIL

@sipsma
Copy link
Copy Markdown
Contributor Author

sipsma commented Dec 17, 2019

@estesp Just to understand the tests a bit better, it looks like those same failures occurred in both the run with TEST_RUNTIME=io.containerd.runc.v1 and the run with TEST_RUNTIME=io.containerd.runc.v2. But my change is only to runtime/v2/runc/v2/service.go so I wouldn't have expected io.containerd.runc.v1 to be impacted at all. Is there something I'm missing or some dependency between the two?

@sipsma
Copy link
Copy Markdown
Contributor Author

sipsma commented Dec 17, 2019

@estesp I looked into the failure a bit more, I am able to reproduce it locally when running against containerd master w/out the changes in this PR.

I'm not sure if there's an easier way, but I reproduced by starting a go docker container:

docker volume rm -f containerd-test-varlib && docker run --rm -it --privileged --volume containerd-test-varlib:/var/lib golang:1.13

and then running the equivalent of what travis does AFAICT:

apt update && \
apt install -y btrfs-tools libnl-3-dev  libnet-dev  protobuf-c-compiler  python-minimal libcap-dev libaio-dev libprotobuf-c-dev libprotobuf-dev socat unzip libbtrfs-dev && \
mkdir -p /go/src/github.com/containerd && \
cd /go/src/github.com/containerd && \
git clone https://github.com/containerd/containerd.git && \
cd containerd && \
script/setup/install-protobuf && \
chmod +x /usr/local/bin/protoc && \
chmod og+rx /usr/local/include/google /usr/local/include/google/protobuf /usr/local/include/google/protobuf/compiler && \
chmod -R og+r /usr/local/include/google/protobuf/ && \
script/setup/install-seccomp && \
script/setup/install-runc && \
script/setup/install-cni && \
script/setup/install-critools && \
wget https://github.com/checkpoint-restore/criu/archive/v3.12.tar.gz -O /tmp/criu.tar.gz && \
tar -C /tmp/ -zxf /tmp/criu.tar.gz && \
make -C /tmp/criu-3.12 install-criu && \
make && \
make install

Then when I run the integ tests, I get the same CRIU test failures

TEST_RUNTIME=io.containerd.runc.v1 make integration | tee test.log
...
--- FAIL: TestCheckpointRestorePTY (0.67s)
    container_checkpoint_test.go:104: runc did not terminate successfully: criu failed: type NOTIFY errno 0 path= /run/containerd-test/io.containerd.runtime.v2.task/testing/TestCheckpointRestorePTY/criu-dump.log: unknown
--- FAIL: TestCheckpointRestore (0.30s)
    container_checkpoint_test.go:218: runc did not terminate successfully: criu failed: type NOTIFY errno 0 path= /run/containerd-test/io.containerd.runtime.v2.task/testing/TestCheckpointRestore/criu-dump.log: unknown
--- FAIL: TestCheckpointRestoreNewContainer (0.29s)
    container_checkpoint_test.go:306: runc did not terminate successfully: criu failed: type NOTIFY errno 0 path= /run/containerd-test/io.containerd.runtime.v2.task/testing/TestCheckpointRestoreNewContainer/criu-dump.log: unknown
--- FAIL: TestCheckpointLeaveRunning (0.31s)
    container_checkpoint_test.go:398: runc did not terminate successfully: criu failed: type NOTIFY errno 0 path= /run/containerd-test/io.containerd.runtime.v2.task/testing/TestCheckpointLeaveRunning/criu-dump.log: unknown
--- FAIL: TestCRWithImagePath (0.27s)
    container_checkpoint_test.go:465: runc did not terminate successfully: criu failed: type NOTIFY errno 0 path= /run/containerd-test/io.containerd.runtime.v2.task/testing/TestCRWithImagePath-checkpoint/criu-dump.log: unknown
--- FAIL: TestShimDoesNotLeakPipes (0.00s)
    container_linux_test.go:260: exit status 1
--- FAIL: TestContainerList (0.01s)
    container_test.go:70: expected 0 containers but received 5

There are a few extra errors (perhaps because I didn't replicate the travis env perfectly or container-in-container issues), but those CRIU errors are consistent and look to be the same as what this PR is getting.

It's very strange though because I can see recent builds of containerd's master branch all succeeded... Any ideas? Is it possible it's the failures are unrelated to the code in this PR?

@sipsma sipsma force-pushed the close-platform-once branch from 33ad230 to 51f941e Compare December 17, 2019 23:12
@theopenlab-ci
Copy link
Copy Markdown

theopenlab-ci Bot commented Dec 17, 2019

Build succeeded.

@sipsma
Copy link
Copy Markdown
Contributor Author

sipsma commented Dec 18, 2019

Tried an empty force-push to re-run the tests, but got the exact same results.

One other thing I just noticed that may or may not be relevant, the most recent build of containerd's master that succeeded was using a travis worker with kernel version 4.15.0-1040-gcp. However, the runs here are using kernel version 5.0.0-1026-gcp in travis (seems like the travis workers may have been updated?)

When I reproduced the criu test failures locally on master branch, it was on a machine with kernel 5.0.0-1022-aws.

Is it possible for the containerd maintainers to retry builds of master in travis to see if a travis worker kernel upgrade could be interfering with criu tests?

@estesp
Copy link
Copy Markdown
Member

estesp commented Dec 18, 2019

@sipsma you are correct; it is a timing issue with you happening to catch a Travis kernel upgrade that no other PR or branch run has been on! A draft PR I just created hit the issue without any code changes. Now testing against a later criu build in case there are fixes for a 5.0 kernel issue.

Previously, the platform was closed as part of the Delete method when the
process was an init for a task and there were no more tasks after its deletion.
This can create problems if another task is created within the shim right after
the delete runs, which results in the platform being closed but the shim
continuing to run.

This change moves closing the platform to the Shutdown method after the shim's
context is canceled, which ensures the platform is only closed once the shim
is sure its done servicing containers.

Signed-off-by: Erik Sipsma <[email protected]>
@estesp estesp force-pushed the close-platform-once branch from 51f941e to fbd46d7 Compare December 19, 2019 14:47
@theopenlab-ci
Copy link
Copy Markdown

theopenlab-ci Bot commented Dec 19, 2019

Build succeeded.

@codecov-io
Copy link
Copy Markdown

Codecov Report

Merging #3896 into master will increase coverage by 4.46%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3896      +/-   ##
==========================================
+ Coverage      38%   42.47%   +4.46%     
==========================================
  Files          88      130      +42     
  Lines       12326    14716    +2390     
==========================================
+ Hits         4685     6251    +1566     
- Misses       6994     7544     +550     
- Partials      647      921     +274
Flag Coverage Δ
#linux 45.86% <ø> (?)
#windows 38% <ø> (ø) ⬆️
Impacted Files Coverage Δ
mount/temp_unix.go 0% <0%> (ø)
sys/reaper_linux.go 0% <0%> (ø)
services/server/server_linux.go 0% <0%> (ø)
sys/env.go 100% <0%> (ø)
snapshots/devmapper/snapshotter.go 66.39% <0%> (ø)
sys/stat_unix.go 0% <0%> (ø)
runtime/v2/shim/shim_unix.go 0% <0%> (ø)
sys/reaper.go 0% <0%> (ø)
oci/spec_opts_linux.go 0% <0%> (ø)
sys/fds.go 0% <0%> (ø)
... and 41 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 05bc0a1...fbd46d7. Read the comment docs.

Copy link
Copy Markdown
Member

@estesp estesp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@estesp estesp requested a review from crosbymichael December 19, 2019 15:35
@crosbymichael
Copy link
Copy Markdown
Member

LGTM

@crosbymichael crosbymichael merged commit 1809231 into containerd:master Dec 19, 2019
@estesp
Copy link
Copy Markdown
Member

estesp commented Dec 20, 2019

Cherry picked to release/1.3 in #3907

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

runtime: runc v2 shim closes platform prematurely and multiple times

4 participants