This has been causing CI errors for several weeks regularly; somehow usually more on master than in PR triggers.
Turns out it is a publish exit event timing problem, that, when it does fail with the bind: address already in use for the abstract socket for this specific test, it is always the 2nd run (the parallel run of make integration). This is because the first run has an event publisher hanging on trying to connect to the (now exited) containerd test daemon:
$ sudo go test -v -test.root -run DaemonRuntimeRoot
INFO[0000] running tests against containerd revision=c28ce39cea8e8dc8d7ff13c0fa0f7ca6217c2dab.m version=v1.2.0-beta.2-76-gc28ce39.m
=== RUN TestDaemonRuntimeRoot
--- FAIL: TestDaemonRuntimeRoot (1.29s)
daemon_config_linux_test.go:153: io.containerd.runc.v1: failed to listen to abstract unix socket "/containerd-shim/testing/TestDaemonRuntimeRoot/shim.sock": listen
unix /containerd-shim/testing/TestDaemonRuntimeRoot/shim.sock: bind: address already in use
: exit status 1: unknown
FAIL
$ ps aux | grep containerd
root 12951 3.9 1.8 190244 37208 pts/2 Sl 10:53 0:00 /usr/local/bin/containerd --address /tmp/containerd-test-new-daemon-with-config768890017/containerd.sock publish --topic /tasks/exit --namespace testing
This publisher will continue to try to connect to the (now exited) containerd UNIX socket:
[pid 9028] connect(4, {sa_family=AF_LOCAL, sun_path="/tmp/containerd-test-new-daemon-with-config577476199/containerd.sock"}, 71) = -1 ENOENT (No such file or directory)
until it reaches some system-level UNIX socket connection timeout and finally exits; but at this point in Travis CI, it could be (and more recently increasingly higher percentage of times) running that same test again in the second "parallel" run of make integration, hence the failure.
From @crosbymichael:
that should be a general fix, we can add a context timeout for that
This has been causing CI errors for several weeks regularly; somehow usually more on master than in PR triggers.
Turns out it is a publish exit event timing problem, that, when it does fail with the
bind: address already in usefor the abstract socket for this specific test, it is always the 2nd run (the parallel run ofmake integration). This is because the first run has an event publisher hanging on trying to connect to the (now exited) containerd test daemon:This publisher will continue to try to connect to the (now exited) containerd UNIX socket:
until it reaches some system-level UNIX socket connection timeout and finally exits; but at this point in Travis CI, it could be (and more recently increasingly higher percentage of times) running that same test again in the second "parallel" run of
make integration, hence the failure.From @crosbymichael:
that should be a general fix, we can add a context timeout for that