test: introduce failpoint control to runc-shimv2 and cni#7069
Merged
dmcgowan merged 9 commits intocontainerd:mainfrom Jul 27, 2022
Merged
test: introduce failpoint control to runc-shimv2 and cni#7069dmcgowan merged 9 commits intocontainerd:mainfrom
dmcgowan merged 9 commits intocontainerd:mainfrom
Conversation
|
Skipping CI for Draft Pull Request. |
775c1e2 to
302cee4
Compare
AkihiroSuda
reviewed
Jun 17, 2022
f5c9275 to
eaeefbe
Compare
d96ad3d to
be135f5
Compare
3d34bed to
e950c1b
Compare
Member
Author
|
@AkihiroSuda @mikebrow @dmcgowan ping~ |
This was referenced Jul 14, 2022
Closed
dmcgowan
reviewed
Jul 19, 2022
dmcgowan
approved these changes
Jul 19, 2022
f916e51 to
e637919
Compare
Member
Author
|
/retest |
Failpoint is used to control the fail during API call when testing, especially the API is complicated like CRI-RunPodSandbox. It can help us to test the unexpected behavior without mock. The control design is based on freebsd fail(9), but simpler. REF: https://www.freebsd.org/cgi/man.cgi?query=fail&sektion=9&apropos=0&manpath=FreeBSD%2B10.0-RELEASE Signed-off-by: Wei Fu <[email protected]>
Currently, the runc shimv2 commandline manager doesn't support ttrpc server's customized option, for example, the ttrpc server interceptor. This commit is to allow the task plugin can return the `UnaryServerInterceptor` option to the manager so that the task plugin can do enhancement before handling the incoming request, like API-level failpoint control. Signed-off-by: Wei Fu <[email protected]>
Added new runc shim binary in integration testing.
The shim is named by io.containerd.runc-fp.v1, which allows us to use
additional OCI annotation `io.containerd.runtime.v2.shim.failpoint.*` to
setup shim task API's failpoint. Since the shim can be shared with
multiple container, like what kubernetes pod does, the failpoint will be
initialized during setup the shim server. So, the following the
container's OCI failpoint's annotation will not work.
This commit also updates the ctr tool that we can use `--annotation` to
specify annotations when run container. For example:
```bash
➜ ctr run -d --runtime runc-fp.v1 \
--annotation "io.containerd.runtime.v2.shim.failpoint.Kill=1*error(sorry)" \
docker.io/library/alpine:latest testing sleep 1d
➜ ctr t ls
TASK PID STATUS
testing 147304 RUNNING
➜ ctr t kill -s SIGKILL testing
ctr: sorry: unknown
➜ ctr t kill -s SIGKILL testing
➜ sudo ctr t ls
TASK PID STATUS
testing 147304 STOPPED
```
The runc-fp.v1 shim is based on core runc.v2. We can use it to inject
failpoint during testing complicated or big transcation API, like
kubernetes PodRunPodsandbox.
Signed-off-by: Wei Fu <[email protected]>
If there is any unskipable error during setuping shim plugins, we should fail return error to prevent from leaky shim instance. For example, there is error during init task plugin, the shim ttrpc server will not contain any shim API method. The any call to the shim will receive that failed to create shim task: service containerd.task.v2.Task: not implemented Then containerd can't use `Shutdown` to let the shim close. The shim will be leaky. And also fail return if there is no ttrpc service. Signed-off-by: Wei Fu <[email protected]>
Signed-off-by: Wei Fu <[email protected]>
Introduce cni-bridge-fp as CNI bridge plugin wrapper binary for CRI
testing.
With CNI `io.kubernetes.cri.pod-annotations` capability enabled, the user
can inject the failpoint setting by pod's annotation
`cniFailpointControlStateDir`, which stores each pod's failpoint setting
named by `${K8S_POD_NAMESPACE}-${K8S_POD_NAME}.json`.
When the plugin is invoked, the plugin will check the CNI_ARGS to get
the failpoint for the CNI_COMMAND from disk. For the testing, the user
can prepare setting before RunPodSandbox.
Signed-off-by: Wei Fu <[email protected]>
Signed-off-by: Wei Fu <[email protected]>
Signed-off-by: Wei Fu <[email protected]>
e637919 to
7f51060
Compare
mikebrow
approved these changes
Jul 22, 2022
Member
mikebrow
left a comment
There was a problem hiding this comment.
LGTM this will be very useful!
AkihiroSuda
reviewed
Jul 23, 2022
| @@ -0,0 +1,32 @@ | |||
| //go:build linux | |||
| // +build linux | |||
Member
There was a problem hiding this comment.
You do not need this if you change the file name to main_linux.go
* Use delegated plugin call to simplify cni-bridge-cni * Add README.md for cni-bridge-cni Signed-off-by: Wei Fu <[email protected]>
7f51060 to
e6a2c07
Compare
Member
Author
|
Ping @AkihiroSuda @dmcgowan ~ |
dmcgowan
approved these changes
Jul 27, 2022
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
For the CRI-plugin, it is hard to test the rollback corner case, for instance,
there is timeout issue about
shim.Startwhen the node is under high loadpressure. The high load pressure also impacts the
shim.Delete/KillAPI. But,the high load pressure is hard to be emulated.
And even if it is easy to introduce change like #5904 and #7044, currently,
the developer verifies the change manually. It is painful and is easy to
cause regression issue.
Based on that, I would like to introduce failpoint control to runc shimv2 and
CNI plugin during running CI testing.
Both the runc shimv2 and CNI plugin are external plugin in containerd
ecosystem. We can add failpoint in ttrpc interceptor and the CNI plugin
wrapper instead of modifing the plugin's source code. And the failpoint
control is based on freebsd failpoint design, which allows us to enable
failpoint by text description. It is human readable and easy maintained.
Signed-off-by: Wei Fu [email protected]