Skip to content

test: introduce failpoint control to runc-shimv2 and cni#7069

Merged
dmcgowan merged 9 commits intocontainerd:mainfrom
fuweid:failpoint-in-runc-shimv2
Jul 27, 2022
Merged

test: introduce failpoint control to runc-shimv2 and cni#7069
dmcgowan merged 9 commits intocontainerd:mainfrom
fuweid:failpoint-in-runc-shimv2

Conversation

@fuweid
Copy link
Copy Markdown
Member

@fuweid fuweid commented Jun 16, 2022

For the CRI-plugin, it is hard to test the rollback corner case, for instance,
there is timeout issue about shim.Start when the node is under high load
pressure. The high load pressure also impacts the shim.Delete/Kill API. But,
the high load pressure is hard to be emulated.

And even if it is easy to introduce change like #5904 and #7044, currently,
the developer verifies the change manually. It is painful and is easy to
cause regression issue.

Based on that, I would like to introduce failpoint control to runc shimv2 and
CNI plugin during running CI testing.

Both the runc shimv2 and CNI plugin are external plugin in containerd
ecosystem. We can add failpoint in ttrpc interceptor and the CNI plugin
wrapper instead of modifing the plugin's source code. And the failpoint
control is based on freebsd failpoint design, which allows us to enable
failpoint by text description. It is human readable and easy maintained.

Signed-off-by: Wei Fu [email protected]

@k8s-ci-robot
Copy link
Copy Markdown

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@fuweid fuweid changed the title test: introduce runc shimv2 with failpoint contorl test: introduce runc shimv2 with failpoint control Jun 16, 2022
@fuweid fuweid force-pushed the failpoint-in-runc-shimv2 branch 2 times, most recently from 775c1e2 to 302cee4 Compare June 16, 2022 16:20
Comment thread pkg/failpoint/fail.go
@fuweid fuweid force-pushed the failpoint-in-runc-shimv2 branch 2 times, most recently from f5c9275 to eaeefbe Compare June 18, 2022 09:28
@fuweid fuweid changed the title test: introduce runc shimv2 with failpoint control test: introduce failpoint control to runc-shimv2 and cni Jun 18, 2022
@fuweid fuweid force-pushed the failpoint-in-runc-shimv2 branch 11 times, most recently from d96ad3d to be135f5 Compare June 21, 2022 15:41
@fuweid fuweid marked this pull request as ready for review June 21, 2022 16:17
@fuweid fuweid force-pushed the failpoint-in-runc-shimv2 branch 2 times, most recently from 3d34bed to e950c1b Compare June 28, 2022 00:08
@fuweid
Copy link
Copy Markdown
Member Author

fuweid commented Jun 30, 2022

@AkihiroSuda @mikebrow @dmcgowan ping~

@dmcgowan dmcgowan added this to the 1.7 milestone Jul 14, 2022
Comment thread integration/failpoint/cmd/cni-bridge-fp/main.go Outdated
@fuweid fuweid force-pushed the failpoint-in-runc-shimv2 branch 2 times, most recently from f916e51 to e637919 Compare July 21, 2022 16:20
@fuweid
Copy link
Copy Markdown
Member Author

fuweid commented Jul 22, 2022

/retest

fuweid added 8 commits July 22, 2022 23:25
Failpoint is used to control the fail during API call when testing, especially
the API is complicated like CRI-RunPodSandbox. It can help us to test
the unexpected behavior without mock. The control design is based on freebsd
fail(9), but simpler.

REF: https://www.freebsd.org/cgi/man.cgi?query=fail&sektion=9&apropos=0&manpath=FreeBSD%2B10.0-RELEASE

Signed-off-by: Wei Fu <[email protected]>
Currently, the runc shimv2 commandline manager doesn't support ttrpc
server's customized option, for example, the ttrpc server interceptor.
This commit is to allow the task plugin can return the
`UnaryServerInterceptor` option to the manager so that the task plugin
can do enhancement before handling the incoming request, like API-level
failpoint control.

Signed-off-by: Wei Fu <[email protected]>
Added new runc shim binary in integration testing.

The shim is named by io.containerd.runc-fp.v1, which allows us to use
additional OCI annotation `io.containerd.runtime.v2.shim.failpoint.*` to
setup shim task API's failpoint. Since the shim can be shared with
multiple container, like what kubernetes pod does, the failpoint will be
initialized during setup the shim server. So, the following the
container's OCI failpoint's annotation will not work.

This commit also updates the ctr tool that we can use `--annotation` to
specify annotations when run container. For example:

```bash
➜  ctr run -d --runtime runc-fp.v1 \
     --annotation "io.containerd.runtime.v2.shim.failpoint.Kill=1*error(sorry)" \
     docker.io/library/alpine:latest testing sleep 1d

➜  ctr t ls
TASK       PID       STATUS
testing    147304    RUNNING

➜  ctr t kill -s SIGKILL testing
ctr: sorry: unknown

➜  ctr t kill -s SIGKILL testing

➜  sudo ctr t ls
TASK       PID       STATUS
testing    147304    STOPPED
```

The runc-fp.v1 shim is based on core runc.v2. We can use it to inject
failpoint during testing complicated or big transcation API, like
kubernetes PodRunPodsandbox.

Signed-off-by: Wei Fu <[email protected]>
If there is any unskipable error during setuping shim plugins, we should
fail return error to prevent from leaky shim instance. For example,
there is error during init task plugin, the shim ttrpc server will not
contain any shim API method. The any call to the shim will receive that

  failed to create shim task: service containerd.task.v2.Task: not implemented

Then containerd can't use `Shutdown` to let the shim close. The shim
will be leaky. And also fail return if there is no ttrpc service.

Signed-off-by: Wei Fu <[email protected]>
Introduce cni-bridge-fp as CNI bridge plugin wrapper binary for CRI
testing.

With CNI `io.kubernetes.cri.pod-annotations` capability enabled, the user
can inject the failpoint setting by pod's annotation
`cniFailpointControlStateDir`, which stores each pod's failpoint setting
named by `${K8S_POD_NAMESPACE}-${K8S_POD_NAME}.json`.

When the plugin is invoked, the plugin will check the CNI_ARGS to get
the failpoint for the CNI_COMMAND from disk. For the testing, the user
can prepare setting before RunPodSandbox.

Signed-off-by: Wei Fu <[email protected]>
@fuweid fuweid force-pushed the failpoint-in-runc-shimv2 branch from e637919 to 7f51060 Compare July 22, 2022 15:26
Copy link
Copy Markdown
Member

@mikebrow mikebrow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM this will be very useful!

@@ -0,0 +1,32 @@
//go:build linux
// +build linux
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You do not need this if you change the file name to main_linux.go

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

* Use delegated plugin call to simplify cni-bridge-cni
* Add README.md for cni-bridge-cni

Signed-off-by: Wei Fu <[email protected]>
@fuweid fuweid force-pushed the failpoint-in-runc-shimv2 branch from 7f51060 to e6a2c07 Compare July 24, 2022 03:46
@fuweid
Copy link
Copy Markdown
Member Author

fuweid commented Jul 25, 2022

Ping @AkihiroSuda @dmcgowan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cherry-picked/1.6.x PR commits are cherry-picked into release/1.6 branch kind/test

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants