Daemon to take care of ingress cleanup on cluster leave and graceful shutdown by aboch · Pull Request #32283 · moby/moby

aboch · 2017-03-31T21:23:39Z

Now that the daemon has the infrastructure to release the ingress resources,
we can have the daemon naturally take care of the cleanup of the resources he created.
The cleanup will be done on cluster leave and on graceful shutdown.
This will also avoid a panic revealed by the integration tests where both the daemon
and libnetwork were racing in disconnecting the ingress network endpoints on a daemon
shutdown.
Also allow to rename containers on swarm networks

Fixes #32173

- A picture of a cute animal (not mandatory but encouraged)

aaronlehmann · 2017-03-31T21:28:10Z

Thanks for figuring out and fixing the panic, @aboch!

aaronlehmann · 2017-03-31T21:29:17Z

vendor.conf

Should we wait for a specific PR to go into libnetwork upstream before merging this PR?

Yes, currently this PR is vendoring from a private libentwork .

aaronlehmann · 2017-03-31T21:34:31Z

daemon/network.go

How about having SetupIngress return a channel instead of taking one as an argument? It looks like SetupIngress is called rarely, and the overhead of allocating an extra channel when it's not needed would be negligible. The function signature could be:

func (daemon *Daemon) SetupIngress(create clustertypes.NetworkCreateRequest, nodeIP string) (<-chan struct, error) {

Thanks, that would be cleaner.
I agree the overhead is negligible.

aaronlehmann · 2017-03-31T21:35:26Z

daemon/network.go

Similarly to my comment on SetupIngress, ReleaseIngress could return a channel that gets closed when the job is completed.

func (daemon *Daemon) ReleaseIngress() (<-chan struct, error) {

aaronlehmann · 2017-03-31T21:38:37Z

daemon/daemon.go

This function calls clearAttachableNetworks - is that safe to call as part of the shutdown process? It seems to delete some networks.

Yes, as long as it is being called before netController.Stop().
Currently, on daemon graceful shutdown, we were skipping all the cluster resources cleanup.
I feel having this gracefully cleanup what done during leave is the proper thing to do.
Although I am not sure yet if better to invoke this before or after the containers shutdown.
@tonistiigi input here would help.

Regarding the networks being deleted, that is the right thing to do as the swarm networks are dynamic networks. Otherwise the respective plumbing would not be deleted on graceful shutdown of a daemon participating in a cluster. Neither it will be taken care on next daemon boot, as libnetwork is no longer aware those network existed (dynamic). So this is actually taking care of a today missing logic.

Just to confirm, would these networks be recreated when the daemon next starts up?

No. The dynamic networks are owned by swarm and created on demand when the first task (or user run container) is connected to them. And they are removed from the node once the last task is disconnected. This is the current behavior.

So yes, on next daemon startup they will be recreated if the node is meant to join the cluster and the scheduler dispatch a task on this node on that network. Or if a user run container on that network has restart policy configured.

aaronlehmann · 2017-03-31T23:12:05Z

daemon/daemon.go

Let's either check the error here, or change ReleaseIngress so it does not return an error (currently it always returns nil).

Ok. I will check the error. Just in case in future the function scope increases.

aboch · 2017-04-01T00:58:15Z

Ok, this is interesting:

23:24:31 
23:24:31 ----------------------------------------------------------------------
23:24:31 FAIL: check_test.go:353: DockerSwarmSuite.TearDownTest
23:24:31 
23:24:31 check_test.go:358:
23:24:31     d.Stop(c)
23:24:31 daemon/daemon.go:392:
23:24:31     t.Fatalf("Error while stopping the daemon %s : %v", d.id, err)
23:24:31 ... Error: Error while stopping the daemon d16e9027b5052 : exit status 2
23:24:31 
23:24:31 
23:24:31 ----------------------------------------------------------------------
23:24:31 PANIC: docker_cli_swarm_test.go:1680: DockerSwarmSuite.TestSwarmPublishDuplicatePorts
23:24:31 
23:24:31 [d16e9027b5052] waiting for daemon to start
23:24:31 [d16e9027b5052] daemon started
23:24:31 
23:24:31 [d16e9027b5052] exiting daemon
23:24:31 ... Panic: Fixture has panicked (see related PANIC)
23:24:31 
23:24:31 ----------------------------------------------------------------------

aboch · 2017-04-01T14:51:28Z

I moved the ingress cleanup after resetting the cluster provider, so after the nw agent to stop listening on cluster events.

Experimental now passed: https://jenkins.dockerproject.org/job/Docker-PRs-experimental/32391/console

I will spin few more to make sure the issue is gone.

aaronlehmann · 2017-04-04T01:58:49Z

@aboch: How's it looking?

aboch · 2017-04-04T22:19:00Z

New failure. At first look it does not seem related to this change.
Found this existing issue #30469, but it is supposed to be resolved.
Maybe there is some other code path which can create same failure.

https://jenkins.dockerproject.org/job/Docker-PRs/41130/console

Janky test logs:

PASS: docker_cli_exec_test.go:503: DockerSuite.TestExecStartFails       0.486s

----------------------------------------------------------------------
FAIL: check_test.go:95: DockerSuite.TearDownTest

check_test.go:96:
    testEnv.Clean(c, dockerBinary)
environment/clean.go:63:
    t.Fatalf("error removing containers %v : %v (%s)", containers, result.Error, result.Combined())
... Error: error removing containers [09b20cae072c] : exit status 1 (Error response from daemon: Unable to remove filesystem for 09b20cae072c496bdf92828b90c886276763fb5f52197b41863cfd4ba3373e7c: remove /var/lib/docker/containers/09b20cae072c496bdf92828b90c886276763fb5f52197b41863cfd4ba3373e7c/shm: device or resource busy
)


----------------------------------------------------------------------
PANIC: docker_cli_exec_test.go:226: DockerSuite.TestExecStopNotHanging

docker_cli_exec_test.go:248:
    c.Fatal("Container stop timed out")
... Error: Container stop timed out

... Panic: Fixture has panicked (see related PANIC)

----------------------------------------------------------------------
MISS: docker_cli_exec_unix_test.go:43: DockerSuite.TestExecTTY

Daemon logs:

time="2017-04-04T19:22:13.123010445Z" level=debug msg="Calling POST /v1.29/containers/testing/stop" 
time="2017-04-04T19:22:13.123059553Z" level=debug msg="Sending kill signal 15 to container 09b20cae072c496bdf92828b90c886276763fb5f52197b41863cfd4ba3373e7c" 
time="2017-04-04T19:22:13.167246120Z" level=debug msg="libcontainerd: received containerd event: &types.Event{Type:\"start-process\", Id:\"09b20cae072c496bdf92828b90c886276763fb5f52197b41863cfd4ba3373e7c\", Status:0x0, Pid:\"96f8bc260881041bd76766cdb5d25f3584dde7ecf8b37673f2aaf49902
3f7d70\", Timestamp:(*timestamp.Timestamp)(0xc42174fa30)}" 
time="2017-04-04T19:22:13.167357788Z" level=debug msg="libcontainerd: event unhandled: type:\"start-process\" id:\"09b20cae072c496bdf92828b90c886276763fb5f52197b41863cfd4ba3373e7c\" pid:\"96f8bc260881041bd76766cdb5d25f3584dde7ecf8b37673f2aaf499023f7d70\" timestamp:<seconds:149133373
3 nanos:166613185 > " 
time="2017-04-04T19:22:13.208354556Z" level=debug msg="containerd: process exited" id=09b20cae072c496bdf92828b90c886276763fb5f52197b41863cfd4ba3373e7c pid=init status=143 systemPid=13245 
time="2017-04-04T19:22:13.213525414Z" level=error msg="containerd: get exit status" error="containerd: process has not exited" id=09b20cae072c496bdf92828b90c886276763fb5f52197b41863cfd4ba3373e7c pid=96f8bc260881041bd76766cdb5d25f3584dde7ecf8b37673f2aaf499023f7d70 systemPid=13296 
time="2017-04-04T19:22:13.213554014Z" level=debug msg="containerd: process exited" id=09b20cae072c496bdf92828b90c886276763fb5f52197b41863cfd4ba3373e7c pid=96f8bc260881041bd76766cdb5d25f3584dde7ecf8b37673f2aaf499023f7d70 status=255 systemPid=13296 
panic: close of nil channel

goroutine 10029 [running]:
panic(0x8aae20, 0xc420363e40)
        /usr/local/go/src/runtime/panic.go:500 +0x1a1
github.com/docker/containerd/supervisor.(*Supervisor).execExit.func1(0xc4203666e0, 0xc42000d5f0, 0x0)
        /tmp/tmp.qEcmHwPKNl/src/github.com/docker/containerd/supervisor/exit.go:90 +0x10c
created by github.com/docker/containerd/supervisor.(*Supervisor).execExit
        /tmp/tmp.qEcmHwPKNl/src/github.com/docker/containerd/supervisor/exit.go:91 +0xee
time="2017-04-04T19:22:13.217072157Z" level=error msg="libcontainerd: failed to receive event from containerd: rpc error: code = 13 desc = transport is closing" 
time="2017-04-04T19:22:13.405883527Z" level=debug msg="libcontainerd: containerd health check returned error: rpc error: code = 14 desc = grpc: the connection is unavailable" 
time="2017-04-04T19:22:13.905860467Z" level=debug msg="libcontainerd: containerd health check returned error: rpc error: code = 14 desc = grpc: the connection is unavailable" 
time="2017-04-04T19:22:13.908546631Z" level=info msg="libcontainerd: new containerd process, pid: 13315" 
time="2017-04-04T19:22:13.927451789Z" level=debug msg="containerd: grpc api on /var/run/docker/libcontainerd/docker-containerd.sock" 
time="2017-04-04T19:22:13.930737569Z" level=debug msg="containerd: read past events" count=636 
time="2017-04-04T19:22:13.93092688Z" level=debug msg="containerd: supervisor running" cpus=2 memory=7984 runtime=docker-runc runtimeArgs=[] stateDir="/var/run/docker/libcontainerd/containerd"

aboch · 2017-04-05T03:44:20Z

WindosRS1 failure (https://jenkins.dockerproject.org/job/Docker-PRs-WoW-RS1/12239/console) does not seem related as I saw it in other runs from other PRs: https://jenkins.dockerproject.org/job/Docker-PRs-WoW-RS1/12252/console.

18:17:01 PASS: docker_api_stats_test.go:23: DockerSuite.TestAPIStatsNoStreamGetCpu	13.407s
18:17:15 
18:17:15 ----------------------------------------------------------------------
18:17:15 FAIL: check_test.go:95: DockerSuite.TearDownTest
18:17:15 
18:17:15 check_test.go:96:
18:17:15     testEnv.Clean(c, dockerBinary)
18:17:15 environment/clean.go:63:
18:17:15     t.Fatalf("error removing containers %v : %v (%s)", containers, result.Error, result.Combined())
18:17:15 ... Error: error removing containers [83c3e8220653] : exit status 1 (Error response from daemon: Could not kill running container 83c3e82206534c7e03a8e4d98b976267bdca7c57e6a93754976c01426831fb12, cannot remove - Cannot kill container 83c3e82206534c7e03a8e4d98b976267bdca7c57e6a93754976c01426831fb12: invalid container: 83c3e82206534c7e03a8e4d98b976267bdca7c57e6a93754976c01426831fb12
18:17:15 )
18:17:15 
18:17:15 
18:17:15 ----------------------------------------------------------------------
18:17:15 PANIC: docker_api_stats_test.go:63: DockerSuite.TestAPIStatsStoppedContainerInGoroutines
18:17:15 
18:17:15 ... Panic: Fixture has panicked (see related PANIC)
18:17:15 
18:17:15 ----------------------------------------------------------------------

aaronlehmann · 2017-04-05T05:51:53Z

@aboch: Is this ready to merge?

aboch · 2017-04-05T06:09:18Z

Yes, windowsRS1 failure not related.

aaronlehmann · 2017-04-05T16:59:26Z

daemon/daemon.go

I noticed that setClusterProvider calls daemon.netController.SetClusterProvider, which writes to the c.cfg.Daemon.DisableProvider channel. Since this is asynchronous, do we have to wait for the message sent on that channel to be received? It seems like it's important that the cluster provider be disabled before moving on to ReleaseIngress, because when these were in the opposite order the panic still happened.

Yes, ideally we should wait for the network cluster agent to shutdown. In practice it looks like the ingress removal job queuing and processing take longer than that.
Proper solution would be to add a AgentStopWait() function to Controller.

// DaemonLeavesCluster informs the daemon has left the cluster
func (daemon *Daemon) DaemonLeavesCluster() {

I feel like this function name/comment needs modification. A daemon shutdown is different from a daemon leaves a cluster. If we are not careful, upgrading/restarting daemon may result in error, i.e., the node not able to join the cluster, or has error.

Please refer to #30157 (comment) where the function naming was decided.

This function takes care of the things that need to be done when the node leaves the cluster.
At the same time the same set of actions needs to be taken care when a node is shutting down.

I did not want to duplicate the same code for the shutdown path.

aaronlehmann

Not LGTM until https://github.com/docker/docker/pull/32283/files#r109971301 is addressed

Signed-off-by: Alessandro Boch <[email protected]>

aboch · 2017-04-06T03:21:00Z

Regarding the powerpc failure,
it does not look related

00:23:29 FAIL: check_test.go:353: DockerSwarmSuite.TearDownTest
00:23:29 
00:23:29 unmount of /tmp/docker-execroot/daaa8a8791d13/netns failed: invalid argument
00:23:29 unmount of /tmp/docker-execroot/daaa8a8791d13/netns failed: no such file or directory
00:23:29 check_test.go:358:
00:23:29     d.Stop(c)
00:23:29 daemon/daemon.go:392:
00:23:29     t.Fatalf("Error while stopping the daemon %s : %v", d.id, err)
00:23:29 ... Error: Error while stopping the daemon dfe6bdb86d023 : exit status 130

Then the next test failure message

00:23:29 ----------------------------------------------------------------------
00:23:29 PANIC: docker_api_swarm_test.go:228: DockerSwarmSuite.TestAPISwarmLeaderProxy
00:23:29 
00:23:29 [daaa8a8791d13] waiting for daemon to start
00:23:29 [daaa8a8791d13] daemon started
00:23:29 
00:23:29 [dfe6bdb86d023] waiting for daemon to start
00:23:29 [dfe6bdb86d023] daemon started
00:23:29 
00:23:29 [d5680384ce7a2] waiting for daemon to start
00:23:29 [d5680384ce7a2] daemon started
00:23:29 
00:23:29 [daaa8a8791d13] exiting daemon
00:23:29 [dfe6bdb86d023] daemon started
00:23:29 Attempt #2: daemon is still running with pid 16879
00:23:29 Attempt #3: daemon is still running with pid 16879
00:23:29 Attempt #4: daemon is still running with pid 16879
00:23:29 [dfe6bdb86d023] exiting daemon
00:23:29 ... Panic: Fixture has panicked (see related PANIC)
00:23:29 
00:23:29 ----------------------------------------------------------------------

is not really a panic, it's a induced trace dump on failure to stop the node in cluster.Cleanup. Looks like the integ-ci interprets it as a panic ?

time="2017-04-06T00:23:25.759764000Z" level=error msg="agent failed to clean up assignments" error="context deadline exceeded" 
time="2017-04-06T00:23:25.759849000Z" level=error msg="failed to shut down cluster node: context deadline exceeded" 
time="2017-04-06T00:23:25.764740000Z" level=info msg="Stopping manager" module=node node.id=uorso6lylarb492481folzh0m 
time="2017-04-06T00:23:25.764940000Z" level=info msg="shutting down certificate renewal routine" module="node/tls" node.id=uorso6lylarb492481folzh0m node.role=swarm-manager 
time="2017-04-06T00:23:25.765313000Z" level=error msg="failed to remove node" error="rpc error: code = 10 desc = dispatcher is stopped" method="(*Dispatcher).Session" node.id=ridf3fy6kgvofsrwp6loau9ps node
.session=2je9nhczrjho60wvm8tqjll23 
time="2017-04-06T00:23:25.765621000Z" level=error msg="LogBroker exited with an error" error="context canceled" module=node node.id=uorso6lylarb492481folzh0m 
goroutine 1 [running]:
github.com/docker/docker/pkg/signal.DumpStacks(0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
        /go/src/github.com/docker/docker/pkg/signal/trap.go:82 +0xbc
github.com/docker/docker/daemon/cluster.(*Cluster).Cleanup(0xc420071860)
        /go/src/github.com/docker/docker/daemon/cluster/cluster.go:350 +0x4a8
main.(*DaemonCli).start(0xc42023f2c0, 0xc42023f200, 0x1166b8ab, 0x17, 0xc4203d0400, 0xc42012efa0, 0xc420476870, 0x0, 0x0)
        /go/src/github.com/docker/docker/cmd/dockerd/daemon.go:335 +0x1f7c
main.runDaemon(0x11402700, 0x1166b8ab, 0x17, 0xc4203d0400, 0xc42012efa0, 0xc420476870, 0x0, 0x0)
        /go/src/github.com/docker/docker/cmd/dockerd/docker.go:91 +0x1c0

I will trigger a new run now.

aboch · 2017-04-06T15:33:58Z

@aaronlehmann Taken care of your comment. Thanks.
All green now.

For reference, I have been logging all the extraneous ci failures which could have been incorrectly linked to the one this PR is fixing.

aaronlehmann · 2017-04-06T17:54:52Z

@aboch: Did that failure show stack traces for any other goroutines? It would be useful to see where it was stuck.

aboch · 2017-04-06T18:31:20Z

@aaronlehmann
My guess is this is the routine who is stuck, causing cluster.Cleanup() to dump the stack on node stop failure, because of timeout.

goroutine 184 [syscall]:
syscall.Syscall(0x94, 0x15, 0x0, 0x0, 0xc421248a00, 0x58e58a7d, 0xbaf5d)
        /usr/local/go/src/syscall/asm_linux_ppc64x.s:17 +0xc
syscall.Fdatasync(0x15, 0x0, 0x0)
        /usr/local/go/src/syscall/zsyscall_linux_ppc64le.go:472 +0x48
github.com/docker/docker/vendor/github.com/coreos/etcd/pkg/fileutil.Fdatasync(0xc42002c2d8, 0x0, 0x0)
        /go/src/github.com/docker/docker/vendor/github.com/coreos/etcd/pkg/fileutil/sync_linux.go:33 +0x44
github.com/docker/docker/vendor/github.com/coreos/etcd/wal.(*WAL).sync(0xc4210989c0, 0x0, 0x0)
        /go/src/github.com/docker/docker/vendor/github.com/coreos/etcd/wal/wal.go:444 +0x118
github.com/docker/docker/vendor/github.com/coreos/etcd/wal.(*WAL).Close(0xc4210989c0, 0x0, 0x0)
        /go/src/github.com/docker/docker/vendor/github.com/coreos/etcd/wal/wal.go:509 +0x1b8
github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/state/raft/storage.(*EncryptedRaftLogger).Close(0xc4210cf3e0, 0x3fff9e4d62b8, 0xc4212981c0)
        /go/src/github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/state/raft/storage/storage.go:351 +0xf4
github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/state/raft.(*Node).stop(0xc420b2af00, 0x3fff9e4d62b8, 0xc4212981c0)
        /go/src/github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/state/raft/raft.go:802 +0x1a0
github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/state/raft.(*Node).Run.func1(0xc421254590, 0xc420b2af00, 0x3fff9e4d62b8, 0xc4212981c0)
        /go/src/github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/state/raft/raft.go:509 +0x5c
github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/state/raft.(*Node).Run(0xc420b2af00, 0x3fff9e4d62b8, 0xc4212981c0, 0x0, 0x0)
        /go/src/github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/state/raft/raft.go:682 +0x1758
github.com/docker/docker/vendor/github.com/docker/swarmkit/manager.(*Manager).Run.func5(0xc421024580, 0x3fff9e4d62b8, 0xc4207c1740)
        /go/src/github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/manager.go:506 +0x44
created by github.com/docker/docker/vendor/github.com/docker/swarmkit/manager.(*Manager).Run
        /go/src/github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/manager.go:511 +0x176c

I will attach the full logs in a snippet somewhere.
EDIT: Full log here

aaronlehmann · 2017-04-06T18:49:32Z

I think the agent is stuck, and agent.Leave hangs until the context times out. Then (*Node).Stop triggers the manager shutdown, and returns immediately with a context deadline exceeded error. The goroutine stack you showed above looks like the manager shutting down normally. There might be another goroutine that shows the agent wedged.

I remember fixing a very similar bug recently, so I wonder if there's another case I didn't handle, or something about this change causes the fix to regress.

cc @dongluochen

aboch · 2017-04-06T18:56:04Z

The full daemon logs are now linked in my previous post.
I did look for a stuck Leave but could not find it in the stack trace.

aaronlehmann · 2017-04-06T19:05:54Z

Leave would have returned with a "context deadline exceeded error" by this time, otherwise the code that dumps the stacks would not have been reached. Leave waits for the agent's main loop to do something, so I was suspecting that loop is wedged.

Indeed, it looks like there is a deadlock involving libnetwork:

goroutine 146 [semacquire]:
sync.runtime_Semacquire(0xc42000aff0)
	/usr/local/go/src/runtime/sema.go:47 +0x38
sync.(*Mutex).Lock(0xc42000afec)
	/usr/local/go/src/sync/mutex.go:85 +0x2f8
github.com/docker/docker/vendor/github.com/docker/libnetwork.(*controller).getStores(0xc42000af00, 0x0, 0x0, 0x0)
	/go/src/github.com/docker/docker/vendor/github.com/docker/libnetwork/store.go:76 +0x58
github.com/docker/docker/vendor/github.com/docker/libnetwork.(*controller).getNetworksFromStore(0xc42000af00, 0x0, 0x0, 0x0, 0x0, 0x0)
	/go/src/github.com/docker/docker/vendor/github.com/docker/libnetwork/store.go:145 +0x60
github.com/docker/docker/vendor/github.com/docker/libnetwork.(*controller).Networks(0xc42000af00, 0x0, 0x0, 0x0)
	/go/src/github.com/docker/docker/vendor/github.com/docker/libnetwork/controller.go:868 +0x50
github.com/docker/docker/daemon.(*Daemon).GetNetworkDriverList(0xc420496200, 0x0, 0x0, 0x0)
	/go/src/github.com/docker/docker/daemon/network.go:459 +0x368
github.com/docker/docker/daemon.(*Daemon).showPluginsInfo(0xc420496200, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
	/go/src/github.com/docker/docker/daemon/info.go:176 +0x64
github.com/docker/docker/daemon.(*Daemon).SystemInfo(0xc420496200, 0xc42001d502, 0x0, 0x0)
	/go/src/github.com/docker/docker/daemon/info.go:101 +0x840
github.com/docker/docker/daemon/cluster/executor/container.(*executor).Describe(0xc420e72760, 0x3fff9e4a08e0, 0xc420a758c0, 0x9, 0x0, 0x0)
	/go/src/github.com/docker/docker/daemon/cluster/executor/container/executor.go:34 +0x60
github.com/docker/docker/vendor/github.com/docker/swarmkit/agent.(*Agent).nodeDescriptionWithHostname(0xc420eb6dc0, 0x3fff9e4a08e0, 0xc420a758c0, 0x1, 0x0, 0x0)
	/go/src/github.com/docker/docker/vendor/github.com/docker/swarmkit/agent/agent.go:521 +0x6c
github.com/docker/docker/vendor/github.com/docker/swarmkit/agent.(*Agent).run(0xc420eb6dc0, 0x3fff9e4a08e0, 0xc420a758c0)
	/go/src/github.com/docker/docker/vendor/github.com/docker/swarmkit/agent/agent.go:314 +0x15fc
created by github.com/docker/docker/vendor/github.com/docker/swarmkit/agent.(*Agent).Start.func1
	/go/src/github.com/docker/docker/vendor/github.com/docker/swarmkit/agent/agent.go:79 +0x84

A few other goroutines are blocked on what looks like the same lock. I'm not sure which goroutine is holding it.

aaronlehmann · 2017-04-06T19:15:57Z

Oh, this one is holding it:

goroutine 467 [semacquire]:
sync.runtime_Semacquire(0xc42007186c)
	/usr/local/go/src/runtime/sema.go:47 +0x38
sync.(*RWMutex).RLock(0xc420071860)
	/usr/local/go/src/sync/rwmutex.go:43 +0x9c
github.com/docker/docker/daemon/cluster.(*Cluster).IsAgent(0xc420071860, 0x117ba000)
	/go/src/github.com/docker/docker/daemon/cluster/cluster.go:240 +0x48
github.com/docker/docker/vendor/github.com/docker/libnetwork.(*controller).isAgent(0xc42000af00, 0xc42000af00)
	/go/src/github.com/docker/docker/vendor/github.com/docker/libnetwork/controller.go:638 +0x110
github.com/docker/docker/vendor/github.com/docker/libnetwork.(*network).driver(0xc4210a9e00, 0x1, 0x0, 0x0, 0x0, 0x0)
	/go/src/github.com/docker/docker/vendor/github.com/docker/libnetwork/network.go:767 +0xe0
github.com/docker/docker/vendor/github.com/docker/libnetwork.(*network).addEndpoint(0xc4210a9e00, 0xc421705e40, 0x0, 0x0)
	/go/src/github.com/docker/docker/vendor/github.com/docker/libnetwork/network.go:875 +0x44
github.com/docker/docker/vendor/github.com/docker/libnetwork.(*network).CreateEndpoint(0xc4210a9e00, 0xc420fc11d1, 0x21, 0xc4217311a0, 0x4, 0x4, 0x0, 0x0, 0x0, 0x0)
	/go/src/github.com/docker/docker/vendor/github.com/docker/libnetwork/network.go:950 +0xabc
github.com/docker/docker/daemon.(*Daemon).connectToNetwork(0xc420496200, 0xc4207bd400, 0x1164b5f6, 0x6, 0xc421714480, 0x0, 0x0, 0x0)
	/go/src/github.com/docker/docker/daemon/container_operations.go:735 +0x668
github.com/docker/docker/daemon.(*Daemon).allocateNetwork(0xc420496200, 0xc4207bd400, 0x0, 0x0)
	/go/src/github.com/docker/docker/daemon/container_operations.go:526 +0x378
github.com/docker/docker/daemon.(*Daemon).initializeNetworking(0xc420496200, 0xc4207bd400, 0x0, 0x0)
	/go/src/github.com/docker/docker/daemon/container_operations.go:903 +0x3d0
github.com/docker/docker/daemon.(*Daemon).containerStart(0xc420496200, 0xc4207bd400, 0x0, 0x0, 0x0, 0x0, 0x1, 0x0, 0x0)
	/go/src/github.com/docker/docker/daemon/start.go:140 +0x1f4
github.com/docker/docker/daemon.(*Daemon).ContainerStart(0xc420496200, 0xc421764000, 0x21, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
	/go/src/github.com/docker/docker/daemon/start.go:86 +0x5d4
github.com/docker/docker/daemon/cluster/executor/container.(*containerAdapter).start(0xc421085830, 0x3fff9e4d62b8, 0xc420d77640, 0x0, 0x0)
	/go/src/github.com/docker/docker/daemon/cluster/executor/container/adapter.go:285 +0xd8
github.com/docker/docker/daemon/cluster/executor/container.(*controller).Start(0xc420f264b0, 0x3fff9e4d62b8, 0xc420d77640, 0x0, 0x0)
	/go/src/github.com/docker/docker/daemon/cluster/executor/container/controller.go:189 +0x164
github.com/docker/docker/vendor/github.com/docker/swarmkit/agent/exec.Do(0x3fff9e4d62b8, 0xc420d77640, 0xc4210a9b00, 0x11fdf000, 0xc420f264b0, 0x0, 0x0, 0x0)
	/go/src/github.com/docker/docker/vendor/github.com/docker/swarmkit/agent/exec/controller.go:314 +0x720
github.com/docker/docker/vendor/github.com/docker/swarmkit/agent.(*taskManager).run.func2(0x3fff9e4a08e0, 0xc420f2b140, 0x0, 0x0)
	/go/src/github.com/docker/docker/vendor/github.com/docker/swarmkit/agent/task.go:134 +0x290
github.com/docker/docker/vendor/github.com/docker/swarmkit/agent.runctx(0x3fff9e4a08e0, 0xc420f2b140, 0xc420eafb00, 0xc420f60300, 0xc420d776c0)
	/go/src/github.com/docker/docker/vendor/github.com/docker/swarmkit/agent/helpers.go:9 +0x50
created by github.com/docker/docker/vendor/github.com/docker/swarmkit/agent.(*taskManager).run
	/go/src/github.com/docker/docker/vendor/github.com/docker/swarmkit/agent/task.go:150 +0x51c

The deadlock is that (*Cluster).Cleanup is holding c.mu, but cleaning up involves calling agent.Leave(), which may be blocked by the agent calling into libnetwork code, which needs to acquire the libnetwork controller's mutex. But it can't acquire that mutex because the goroutine shown above is holding it, trying to call into Cluster.

cc @tonistiigi

aaronlehmann · 2017-04-06T19:18:02Z

I don't think the deadlock discussed in the last few comments is related to the changes in this PR, so...

LGTM

But let's make sure to open an issue for the deadlock and try to fix it ASAP.

aboch · 2017-04-06T19:20:14Z

Yes, I was about to post the same findings you posted above

aboch · 2017-04-06T19:44:58Z

@aaronlehmann
Actually I do not think there is a deadlock involving the (*Cluster).mu here: (*Cluster).Cleanup is holding the c.mu.Lock() until it releases it right after the dump was printed. So of course during the dump it is holding the lock and IsAgentcalled from libnetwork is there waiting to acquire it as c.mu.RLock() in the trace dump.

It looks like the reason why the lock is being hold that long is only because the waiting for node.Stop() called in (*Cluster).Cleanup to timeout:

time="2017-04-06T00:23:25.759849000Z" level=error msg="failed to shut down cluster node: context deadline exceeded"

aaronlehmann · 2017-04-06T20:27:52Z

...but it times out because of the deadlock. That's the problem.

aaronlehmann

Put my LGTM in a comment but forgot to do this...

vdemeester

LGTM 🦁

aboch · 2017-04-13T16:30:41Z

This also fixes
#32111
#31913
#30314

GordonTheTurtle added the status/0-triage label Mar 31, 2017

aaronlehmann reviewed Mar 31, 2017

View reviewed changes

aaronlehmann added status/2-code-review status/needs-vendoring and removed status/0-triage labels Mar 31, 2017

aaronlehmann reviewed Mar 31, 2017

View reviewed changes

aboch added the rebuild/experimental label Apr 1, 2017

GordonTheTurtle removed the rebuild/experimental label Apr 1, 2017

aboch added the rebuild/experimental label Apr 1, 2017

GordonTheTurtle removed the rebuild/experimental label Apr 1, 2017

vdemeester requested review from mlaventure and vdemeester April 1, 2017 18:49

aboch added the rebuild/janky label Apr 4, 2017

GordonTheTurtle removed the rebuild/janky label Apr 4, 2017

aaronlehmann added the priority/P1 Important: P1 issues are a top priority and a must-have for the next release. label Apr 4, 2017

aaronlehmann added this to the 17.05.0 milestone Apr 4, 2017

aaronlehmann mentioned this pull request Apr 5, 2017

services: Add support for Credential Spec and SELinux #32339

Merged

aaronlehmann reviewed Apr 5, 2017

View reviewed changes

aaronlehmann suggested changes Apr 5, 2017

View reviewed changes

Daemon to take care of ingress cleanup on leave & shutdown

6f4bb79

Signed-off-by: Alessandro Boch <[email protected]>

aboch added rebuild/z and removed status/needs-vendoring labels Apr 5, 2017

GordonTheTurtle removed the rebuild/z label Apr 6, 2017

aboch added the rebuild/powerpc label Apr 6, 2017

GordonTheTurtle removed the rebuild/powerpc label Apr 6, 2017

aaronlehmann approved these changes Apr 6, 2017

View reviewed changes

vdemeester approved these changes Apr 7, 2017

View reviewed changes

vdemeester merged commit 87562af into moby:master Apr 7, 2017

This was referenced Apr 7, 2017

Swarm/networking deadlock when shutting down daemon #32441

Closed

Daemon shutdown delayed by "broadcasting node event" #32446

Closed

This was referenced Apr 13, 2017

Swarm mode does not listen on published ports #32111

Closed

Recreating Docker swarm leads to Err: listen tcp 0.0.0.0:7946: bind: address already in use #31913

Closed

Overlay networking not working after leaving and rejoining a swarm #30314

Closed

aaronlehmann mentioned this pull request Apr 17, 2017

Flaky test: TestAPISwarmLeaderElection #32673

Open

aboch deleted the clearingress branch November 8, 2017 19:52

thaJeztah mentioned this pull request Nov 26, 2018

panic: close of nil channel #30469

Closed

Conversation

aboch commented Mar 31, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aaronlehmann commented Mar 31, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aaronlehmann Mar 31, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aboch Mar 31, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aboch Mar 31, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aboch Mar 31, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aboch commented Apr 1, 2017

Uh oh!

aboch commented Apr 1, 2017

Uh oh!

aaronlehmann commented Apr 4, 2017

Uh oh!

aboch commented Apr 4, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aboch commented Apr 5, 2017

Uh oh!

aaronlehmann commented Apr 5, 2017

Uh oh!

aboch commented Apr 5, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongluochen Apr 6, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aaronlehmann left a comment

Choose a reason for hiding this comment

Uh oh!

aboch commented Apr 6, 2017

Uh oh!

aboch commented Apr 6, 2017

Uh oh!

aaronlehmann commented Apr 6, 2017

Uh oh!

aboch commented Apr 6, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aaronlehmann commented Apr 6, 2017

Uh oh!

aboch commented Apr 6, 2017

aboch commented Mar 31, 2017 •

edited

Loading

aaronlehmann Mar 31, 2017 •

edited

Loading

aboch Mar 31, 2017 •

edited

Loading

aboch Mar 31, 2017 •

edited

Loading

aboch Mar 31, 2017 •

edited

Loading

aboch commented Apr 4, 2017 •

edited

Loading

dongluochen Apr 6, 2017 •

edited

Loading

aboch commented Apr 6, 2017 •

edited

Loading

aaronlehmann commented Apr 6, 2017 •

edited

Loading