Skip to content

Daemon to take care of ingress cleanup on cluster leave and graceful shutdown#32283

Merged
vdemeester merged 2 commits intomoby:masterfrom
aboch:clearingress
Apr 7, 2017
Merged

Daemon to take care of ingress cleanup on cluster leave and graceful shutdown#32283
vdemeester merged 2 commits intomoby:masterfrom
aboch:clearingress

Conversation

@aboch
Copy link
Contributor

@aboch aboch commented Mar 31, 2017

  • Now that the daemon has the infrastructure to release the ingress resources,
    we can have the daemon naturally take care of the cleanup of the resources he created.
    The cleanup will be done on cluster leave and on graceful shutdown.
  • This will also avoid a panic revealed by the integration tests where both the daemon
    and libnetwork were racing in disconnecting the ingress network endpoints on a daemon
    shutdown.
  • Also allow to rename containers on swarm networks

Fixes #32173

- A picture of a cute animal (not mandatory but encouraged)
gufo

@aaronlehmann
Copy link

Thanks for figuring out and fixing the panic, @aboch!

vendor.conf Outdated

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we wait for a specific PR to go into libnetwork upstream before merging this PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, currently this PR is vendoring from a private libentwork .

Copy link

@aaronlehmann aaronlehmann Mar 31, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about having SetupIngress return a channel instead of taking one as an argument? It looks like SetupIngress is called rarely, and the overhead of allocating an extra channel when it's not needed would be negligible. The function signature could be:

func (daemon *Daemon) SetupIngress(create clustertypes.NetworkCreateRequest, nodeIP string) (<-chan struct, error) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that would be cleaner.
I agree the overhead is negligible.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly to my comment on SetupIngress, ReleaseIngress could return a channel that gets closed when the job is completed.

func (daemon *Daemon) ReleaseIngress() (<-chan struct, error) {

daemon/daemon.go Outdated

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function calls clearAttachableNetworks - is that safe to call as part of the shutdown process? It seems to delete some networks.

Copy link
Contributor Author

@aboch aboch Mar 31, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, as long as it is being called before netController.Stop().
Currently, on daemon graceful shutdown, we were skipping all the cluster resources cleanup.
I feel having this gracefully cleanup what done during leave is the proper thing to do.
Although I am not sure yet if better to invoke this before or after the containers shutdown.
@tonistiigi input here would help.

Copy link
Contributor Author

@aboch aboch Mar 31, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the networks being deleted, that is the right thing to do as the swarm networks are dynamic networks. Otherwise the respective plumbing would not be deleted on graceful shutdown of a daemon participating in a cluster. Neither it will be taken care on next daemon boot, as libnetwork is no longer aware those network existed (dynamic). So this is actually taking care of a today missing logic.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to confirm, would these networks be recreated when the daemon next starts up?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. The dynamic networks are owned by swarm and created on demand when the first task (or user run container) is connected to them. And they are removed from the node once the last task is disconnected. This is the current behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So yes, on next daemon startup they will be recreated if the node is meant to join the cluster and the scheduler dispatch a task on this node on that network. Or if a user run container on that network has restart policy configured.

daemon/daemon.go Outdated

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's either check the error here, or change ReleaseIngress so it does not return an error (currently it always returns nil).

Copy link
Contributor Author

@aboch aboch Mar 31, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I will check the error. Just in case in future the function scope increases.

@aboch
Copy link
Contributor Author

aboch commented Apr 1, 2017

Ok, this is interesting:

23:24:31 
23:24:31 ----------------------------------------------------------------------
23:24:31 FAIL: check_test.go:353: DockerSwarmSuite.TearDownTest
23:24:31 
23:24:31 check_test.go:358:
23:24:31     d.Stop(c)
23:24:31 daemon/daemon.go:392:
23:24:31     t.Fatalf("Error while stopping the daemon %s : %v", d.id, err)
23:24:31 ... Error: Error while stopping the daemon d16e9027b5052 : exit status 2
23:24:31 
23:24:31 
23:24:31 ----------------------------------------------------------------------
23:24:31 PANIC: docker_cli_swarm_test.go:1680: DockerSwarmSuite.TestSwarmPublishDuplicatePorts
23:24:31 
23:24:31 [d16e9027b5052] waiting for daemon to start
23:24:31 [d16e9027b5052] daemon started
23:24:31 
23:24:31 [d16e9027b5052] exiting daemon
23:24:31 ... Panic: Fixture has panicked (see related PANIC)
23:24:31 
23:24:31 ----------------------------------------------------------------------

@aboch
Copy link
Contributor Author

aboch commented Apr 1, 2017

I moved the ingress cleanup after resetting the cluster provider, so after the nw agent to stop listening on cluster events.

Experimental now passed: https://jenkins.dockerproject.org/job/Docker-PRs-experimental/32391/console

I will spin few more to make sure the issue is gone.

@aaronlehmann
Copy link

@aboch: How's it looking?

@aboch
Copy link
Contributor Author

aboch commented Apr 4, 2017

New failure. At first look it does not seem related to this change.
Found this existing issue #30469, but it is supposed to be resolved.
Maybe there is some other code path which can create same failure.

https://jenkins.dockerproject.org/job/Docker-PRs/41130/console

Janky test logs:

PASS: docker_cli_exec_test.go:503: DockerSuite.TestExecStartFails       0.486s

----------------------------------------------------------------------
FAIL: check_test.go:95: DockerSuite.TearDownTest

check_test.go:96:
    testEnv.Clean(c, dockerBinary)
environment/clean.go:63:
    t.Fatalf("error removing containers %v : %v (%s)", containers, result.Error, result.Combined())
... Error: error removing containers [09b20cae072c] : exit status 1 (Error response from daemon: Unable to remove filesystem for 09b20cae072c496bdf92828b90c886276763fb5f52197b41863cfd4ba3373e7c: remove /var/lib/docker/containers/09b20cae072c496bdf92828b90c886276763fb5f52197b41863cfd4ba3373e7c/shm: device or resource busy
)


----------------------------------------------------------------------
PANIC: docker_cli_exec_test.go:226: DockerSuite.TestExecStopNotHanging

docker_cli_exec_test.go:248:
    c.Fatal("Container stop timed out")
... Error: Container stop timed out

... Panic: Fixture has panicked (see related PANIC)

----------------------------------------------------------------------
MISS: docker_cli_exec_unix_test.go:43: DockerSuite.TestExecTTY

Daemon logs:

time="2017-04-04T19:22:13.123010445Z" level=debug msg="Calling POST /v1.29/containers/testing/stop" 
time="2017-04-04T19:22:13.123059553Z" level=debug msg="Sending kill signal 15 to container 09b20cae072c496bdf92828b90c886276763fb5f52197b41863cfd4ba3373e7c" 
time="2017-04-04T19:22:13.167246120Z" level=debug msg="libcontainerd: received containerd event: &types.Event{Type:\"start-process\", Id:\"09b20cae072c496bdf92828b90c886276763fb5f52197b41863cfd4ba3373e7c\", Status:0x0, Pid:\"96f8bc260881041bd76766cdb5d25f3584dde7ecf8b37673f2aaf49902
3f7d70\", Timestamp:(*timestamp.Timestamp)(0xc42174fa30)}" 
time="2017-04-04T19:22:13.167357788Z" level=debug msg="libcontainerd: event unhandled: type:\"start-process\" id:\"09b20cae072c496bdf92828b90c886276763fb5f52197b41863cfd4ba3373e7c\" pid:\"96f8bc260881041bd76766cdb5d25f3584dde7ecf8b37673f2aaf499023f7d70\" timestamp:<seconds:149133373
3 nanos:166613185 > " 
time="2017-04-04T19:22:13.208354556Z" level=debug msg="containerd: process exited" id=09b20cae072c496bdf92828b90c886276763fb5f52197b41863cfd4ba3373e7c pid=init status=143 systemPid=13245 
time="2017-04-04T19:22:13.213525414Z" level=error msg="containerd: get exit status" error="containerd: process has not exited" id=09b20cae072c496bdf92828b90c886276763fb5f52197b41863cfd4ba3373e7c pid=96f8bc260881041bd76766cdb5d25f3584dde7ecf8b37673f2aaf499023f7d70 systemPid=13296 
time="2017-04-04T19:22:13.213554014Z" level=debug msg="containerd: process exited" id=09b20cae072c496bdf92828b90c886276763fb5f52197b41863cfd4ba3373e7c pid=96f8bc260881041bd76766cdb5d25f3584dde7ecf8b37673f2aaf499023f7d70 status=255 systemPid=13296 
panic: close of nil channel

goroutine 10029 [running]:
panic(0x8aae20, 0xc420363e40)
        /usr/local/go/src/runtime/panic.go:500 +0x1a1
github.com/docker/containerd/supervisor.(*Supervisor).execExit.func1(0xc4203666e0, 0xc42000d5f0, 0x0)
        /tmp/tmp.qEcmHwPKNl/src/github.com/docker/containerd/supervisor/exit.go:90 +0x10c
created by github.com/docker/containerd/supervisor.(*Supervisor).execExit
        /tmp/tmp.qEcmHwPKNl/src/github.com/docker/containerd/supervisor/exit.go:91 +0xee
time="2017-04-04T19:22:13.217072157Z" level=error msg="libcontainerd: failed to receive event from containerd: rpc error: code = 13 desc = transport is closing" 
time="2017-04-04T19:22:13.405883527Z" level=debug msg="libcontainerd: containerd health check returned error: rpc error: code = 14 desc = grpc: the connection is unavailable" 
time="2017-04-04T19:22:13.905860467Z" level=debug msg="libcontainerd: containerd health check returned error: rpc error: code = 14 desc = grpc: the connection is unavailable" 
time="2017-04-04T19:22:13.908546631Z" level=info msg="libcontainerd: new containerd process, pid: 13315" 
time="2017-04-04T19:22:13.927451789Z" level=debug msg="containerd: grpc api on /var/run/docker/libcontainerd/docker-containerd.sock" 
time="2017-04-04T19:22:13.930737569Z" level=debug msg="containerd: read past events" count=636 
time="2017-04-04T19:22:13.93092688Z" level=debug msg="containerd: supervisor running" cpus=2 memory=7984 runtime=docker-runc runtimeArgs=[] stateDir="/var/run/docker/libcontainerd/containerd" 

@aaronlehmann aaronlehmann added the priority/P1 Important: P1 issues are a top priority and a must-have for the next release. label Apr 4, 2017
@aaronlehmann aaronlehmann added this to the 17.05.0 milestone Apr 4, 2017
@aboch
Copy link
Contributor Author

aboch commented Apr 5, 2017

WindosRS1 failure (https://jenkins.dockerproject.org/job/Docker-PRs-WoW-RS1/12239/console) does not seem related as I saw it in other runs from other PRs: https://jenkins.dockerproject.org/job/Docker-PRs-WoW-RS1/12252/console.

18:17:01 PASS: docker_api_stats_test.go:23: DockerSuite.TestAPIStatsNoStreamGetCpu	13.407s
18:17:15 
18:17:15 ----------------------------------------------------------------------
18:17:15 FAIL: check_test.go:95: DockerSuite.TearDownTest
18:17:15 
18:17:15 check_test.go:96:
18:17:15     testEnv.Clean(c, dockerBinary)
18:17:15 environment/clean.go:63:
18:17:15     t.Fatalf("error removing containers %v : %v (%s)", containers, result.Error, result.Combined())
18:17:15 ... Error: error removing containers [83c3e8220653] : exit status 1 (Error response from daemon: Could not kill running container 83c3e82206534c7e03a8e4d98b976267bdca7c57e6a93754976c01426831fb12, cannot remove - Cannot kill container 83c3e82206534c7e03a8e4d98b976267bdca7c57e6a93754976c01426831fb12: invalid container: 83c3e82206534c7e03a8e4d98b976267bdca7c57e6a93754976c01426831fb12
18:17:15 )
18:17:15 
18:17:15 
18:17:15 ----------------------------------------------------------------------
18:17:15 PANIC: docker_api_stats_test.go:63: DockerSuite.TestAPIStatsStoppedContainerInGoroutines
18:17:15 
18:17:15 ... Panic: Fixture has panicked (see related PANIC)
18:17:15 
18:17:15 ----------------------------------------------------------------------

@aaronlehmann
Copy link

@aboch: Is this ready to merge?

@aboch
Copy link
Contributor Author

aboch commented Apr 5, 2017

Yes, windowsRS1 failure not related.

daemon/daemon.go Outdated

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that setClusterProvider calls daemon.netController.SetClusterProvider, which writes to the c.cfg.Daemon.DisableProvider channel. Since this is asynchronous, do we have to wait for the message sent on that channel to be received? It seems like it's important that the cluster provider be disabled before moving on to ReleaseIngress, because when these were in the opposite order the panic still happened.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, ideally we should wait for the network cluster agent to shutdown. In practice it looks like the ingress removal job queuing and processing take longer than that.
Proper solution would be to add a AgentStopWait() function to Controller.

Copy link
Contributor

@dongluochen dongluochen Apr 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// DaemonLeavesCluster informs the daemon has left the cluster
func (daemon *Daemon) DaemonLeavesCluster() {

I feel like this function name/comment needs modification. A daemon shutdown is different from a daemon leaves a cluster. If we are not careful, upgrading/restarting daemon may result in error, i.e., the node not able to join the cluster, or has error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please refer to #30157 (comment) where the function naming was decided.

This function takes care of the things that need to be done when the node leaves the cluster.
At the same time the same set of actions needs to be taken care when a node is shutting down.

I did not want to duplicate the same code for the shutdown path.

Copy link

@aaronlehmann aaronlehmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aboch
Copy link
Contributor Author

aboch commented Apr 6, 2017

Regarding the powerpc failure,
it does not look related

00:23:29 FAIL: check_test.go:353: DockerSwarmSuite.TearDownTest
00:23:29 
00:23:29 unmount of /tmp/docker-execroot/daaa8a8791d13/netns failed: invalid argument
00:23:29 unmount of /tmp/docker-execroot/daaa8a8791d13/netns failed: no such file or directory
00:23:29 check_test.go:358:
00:23:29     d.Stop(c)
00:23:29 daemon/daemon.go:392:
00:23:29     t.Fatalf("Error while stopping the daemon %s : %v", d.id, err)
00:23:29 ... Error: Error while stopping the daemon dfe6bdb86d023 : exit status 130

Then the next test failure message

00:23:29 ----------------------------------------------------------------------
00:23:29 PANIC: docker_api_swarm_test.go:228: DockerSwarmSuite.TestAPISwarmLeaderProxy
00:23:29 
00:23:29 [daaa8a8791d13] waiting for daemon to start
00:23:29 [daaa8a8791d13] daemon started
00:23:29 
00:23:29 [dfe6bdb86d023] waiting for daemon to start
00:23:29 [dfe6bdb86d023] daemon started
00:23:29 
00:23:29 [d5680384ce7a2] waiting for daemon to start
00:23:29 [d5680384ce7a2] daemon started
00:23:29 
00:23:29 [daaa8a8791d13] exiting daemon
00:23:29 [dfe6bdb86d023] daemon started
00:23:29 Attempt #2: daemon is still running with pid 16879
00:23:29 Attempt #3: daemon is still running with pid 16879
00:23:29 Attempt #4: daemon is still running with pid 16879
00:23:29 [dfe6bdb86d023] exiting daemon
00:23:29 ... Panic: Fixture has panicked (see related PANIC)
00:23:29 
00:23:29 ----------------------------------------------------------------------

is not really a panic, it's a induced trace dump on failure to stop the node in cluster.Cleanup. Looks like the integ-ci interprets it as a panic ?

time="2017-04-06T00:23:25.759764000Z" level=error msg="agent failed to clean up assignments" error="context deadline exceeded" 
time="2017-04-06T00:23:25.759849000Z" level=error msg="failed to shut down cluster node: context deadline exceeded" 
time="2017-04-06T00:23:25.764740000Z" level=info msg="Stopping manager" module=node node.id=uorso6lylarb492481folzh0m 
time="2017-04-06T00:23:25.764940000Z" level=info msg="shutting down certificate renewal routine" module="node/tls" node.id=uorso6lylarb492481folzh0m node.role=swarm-manager 
time="2017-04-06T00:23:25.765313000Z" level=error msg="failed to remove node" error="rpc error: code = 10 desc = dispatcher is stopped" method="(*Dispatcher).Session" node.id=ridf3fy6kgvofsrwp6loau9ps node
.session=2je9nhczrjho60wvm8tqjll23 
time="2017-04-06T00:23:25.765621000Z" level=error msg="LogBroker exited with an error" error="context canceled" module=node node.id=uorso6lylarb492481folzh0m 
goroutine 1 [running]:
github.com/docker/docker/pkg/signal.DumpStacks(0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
        /go/src/github.com/docker/docker/pkg/signal/trap.go:82 +0xbc
github.com/docker/docker/daemon/cluster.(*Cluster).Cleanup(0xc420071860)
        /go/src/github.com/docker/docker/daemon/cluster/cluster.go:350 +0x4a8
main.(*DaemonCli).start(0xc42023f2c0, 0xc42023f200, 0x1166b8ab, 0x17, 0xc4203d0400, 0xc42012efa0, 0xc420476870, 0x0, 0x0)
        /go/src/github.com/docker/docker/cmd/dockerd/daemon.go:335 +0x1f7c
main.runDaemon(0x11402700, 0x1166b8ab, 0x17, 0xc4203d0400, 0xc42012efa0, 0xc420476870, 0x0, 0x0)
        /go/src/github.com/docker/docker/cmd/dockerd/docker.go:91 +0x1c0

I will trigger a new run now.

@aboch
Copy link
Contributor Author

aboch commented Apr 6, 2017

@aaronlehmann Taken care of your comment. Thanks.
All green now.

For reference, I have been logging all the extraneous ci failures which could have been incorrectly linked to the one this PR is fixing.

@aaronlehmann
Copy link

@aboch: Did that failure show stack traces for any other goroutines? It would be useful to see where it was stuck.

@aboch
Copy link
Contributor Author

aboch commented Apr 6, 2017

@aaronlehmann
My guess is this is the routine who is stuck, causing cluster.Cleanup() to dump the stack on node stop failure, because of timeout.

goroutine 184 [syscall]:
syscall.Syscall(0x94, 0x15, 0x0, 0x0, 0xc421248a00, 0x58e58a7d, 0xbaf5d)
        /usr/local/go/src/syscall/asm_linux_ppc64x.s:17 +0xc
syscall.Fdatasync(0x15, 0x0, 0x0)
        /usr/local/go/src/syscall/zsyscall_linux_ppc64le.go:472 +0x48
github.com/docker/docker/vendor/github.com/coreos/etcd/pkg/fileutil.Fdatasync(0xc42002c2d8, 0x0, 0x0)
        /go/src/github.com/docker/docker/vendor/github.com/coreos/etcd/pkg/fileutil/sync_linux.go:33 +0x44
github.com/docker/docker/vendor/github.com/coreos/etcd/wal.(*WAL).sync(0xc4210989c0, 0x0, 0x0)
        /go/src/github.com/docker/docker/vendor/github.com/coreos/etcd/wal/wal.go:444 +0x118
github.com/docker/docker/vendor/github.com/coreos/etcd/wal.(*WAL).Close(0xc4210989c0, 0x0, 0x0)
        /go/src/github.com/docker/docker/vendor/github.com/coreos/etcd/wal/wal.go:509 +0x1b8
github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/state/raft/storage.(*EncryptedRaftLogger).Close(0xc4210cf3e0, 0x3fff9e4d62b8, 0xc4212981c0)
        /go/src/github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/state/raft/storage/storage.go:351 +0xf4
github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/state/raft.(*Node).stop(0xc420b2af00, 0x3fff9e4d62b8, 0xc4212981c0)
        /go/src/github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/state/raft/raft.go:802 +0x1a0
github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/state/raft.(*Node).Run.func1(0xc421254590, 0xc420b2af00, 0x3fff9e4d62b8, 0xc4212981c0)
        /go/src/github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/state/raft/raft.go:509 +0x5c
github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/state/raft.(*Node).Run(0xc420b2af00, 0x3fff9e4d62b8, 0xc4212981c0, 0x0, 0x0)
        /go/src/github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/state/raft/raft.go:682 +0x1758
github.com/docker/docker/vendor/github.com/docker/swarmkit/manager.(*Manager).Run.func5(0xc421024580, 0x3fff9e4d62b8, 0xc4207c1740)
        /go/src/github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/manager.go:506 +0x44
created by github.com/docker/docker/vendor/github.com/docker/swarmkit/manager.(*Manager).Run
        /go/src/github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/manager.go:511 +0x176c

I will attach the full logs in a snippet somewhere.
EDIT: Full log here

@aaronlehmann
Copy link

I think the agent is stuck, and agent.Leave hangs until the context times out. Then (*Node).Stop triggers the manager shutdown, and returns immediately with a context deadline exceeded error. The goroutine stack you showed above looks like the manager shutting down normally. There might be another goroutine that shows the agent wedged.

I remember fixing a very similar bug recently, so I wonder if there's another case I didn't handle, or something about this change causes the fix to regress.

cc @dongluochen

@aboch
Copy link
Contributor Author

aboch commented Apr 6, 2017

The full daemon logs are now linked in my previous post.
I did look for a stuck Leave but could not find it in the stack trace.

@aaronlehmann
Copy link

Leave would have returned with a "context deadline exceeded error" by this time, otherwise the code that dumps the stacks would not have been reached. Leave waits for the agent's main loop to do something, so I was suspecting that loop is wedged.

Indeed, it looks like there is a deadlock involving libnetwork:

goroutine 146 [semacquire]:
sync.runtime_Semacquire(0xc42000aff0)
	/usr/local/go/src/runtime/sema.go:47 +0x38
sync.(*Mutex).Lock(0xc42000afec)
	/usr/local/go/src/sync/mutex.go:85 +0x2f8
github.com/docker/docker/vendor/github.com/docker/libnetwork.(*controller).getStores(0xc42000af00, 0x0, 0x0, 0x0)
	/go/src/github.com/docker/docker/vendor/github.com/docker/libnetwork/store.go:76 +0x58
github.com/docker/docker/vendor/github.com/docker/libnetwork.(*controller).getNetworksFromStore(0xc42000af00, 0x0, 0x0, 0x0, 0x0, 0x0)
	/go/src/github.com/docker/docker/vendor/github.com/docker/libnetwork/store.go:145 +0x60
github.com/docker/docker/vendor/github.com/docker/libnetwork.(*controller).Networks(0xc42000af00, 0x0, 0x0, 0x0)
	/go/src/github.com/docker/docker/vendor/github.com/docker/libnetwork/controller.go:868 +0x50
github.com/docker/docker/daemon.(*Daemon).GetNetworkDriverList(0xc420496200, 0x0, 0x0, 0x0)
	/go/src/github.com/docker/docker/daemon/network.go:459 +0x368
github.com/docker/docker/daemon.(*Daemon).showPluginsInfo(0xc420496200, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
	/go/src/github.com/docker/docker/daemon/info.go:176 +0x64
github.com/docker/docker/daemon.(*Daemon).SystemInfo(0xc420496200, 0xc42001d502, 0x0, 0x0)
	/go/src/github.com/docker/docker/daemon/info.go:101 +0x840
github.com/docker/docker/daemon/cluster/executor/container.(*executor).Describe(0xc420e72760, 0x3fff9e4a08e0, 0xc420a758c0, 0x9, 0x0, 0x0)
	/go/src/github.com/docker/docker/daemon/cluster/executor/container/executor.go:34 +0x60
github.com/docker/docker/vendor/github.com/docker/swarmkit/agent.(*Agent).nodeDescriptionWithHostname(0xc420eb6dc0, 0x3fff9e4a08e0, 0xc420a758c0, 0x1, 0x0, 0x0)
	/go/src/github.com/docker/docker/vendor/github.com/docker/swarmkit/agent/agent.go:521 +0x6c
github.com/docker/docker/vendor/github.com/docker/swarmkit/agent.(*Agent).run(0xc420eb6dc0, 0x3fff9e4a08e0, 0xc420a758c0)
	/go/src/github.com/docker/docker/vendor/github.com/docker/swarmkit/agent/agent.go:314 +0x15fc
created by github.com/docker/docker/vendor/github.com/docker/swarmkit/agent.(*Agent).Start.func1
	/go/src/github.com/docker/docker/vendor/github.com/docker/swarmkit/agent/agent.go:79 +0x84

A few other goroutines are blocked on what looks like the same lock. I'm not sure which goroutine is holding it.

@aaronlehmann
Copy link

aaronlehmann commented Apr 6, 2017

Oh, this one is holding it:

goroutine 467 [semacquire]:
sync.runtime_Semacquire(0xc42007186c)
	/usr/local/go/src/runtime/sema.go:47 +0x38
sync.(*RWMutex).RLock(0xc420071860)
	/usr/local/go/src/sync/rwmutex.go:43 +0x9c
github.com/docker/docker/daemon/cluster.(*Cluster).IsAgent(0xc420071860, 0x117ba000)
	/go/src/github.com/docker/docker/daemon/cluster/cluster.go:240 +0x48
github.com/docker/docker/vendor/github.com/docker/libnetwork.(*controller).isAgent(0xc42000af00, 0xc42000af00)
	/go/src/github.com/docker/docker/vendor/github.com/docker/libnetwork/controller.go:638 +0x110
github.com/docker/docker/vendor/github.com/docker/libnetwork.(*network).driver(0xc4210a9e00, 0x1, 0x0, 0x0, 0x0, 0x0)
	/go/src/github.com/docker/docker/vendor/github.com/docker/libnetwork/network.go:767 +0xe0
github.com/docker/docker/vendor/github.com/docker/libnetwork.(*network).addEndpoint(0xc4210a9e00, 0xc421705e40, 0x0, 0x0)
	/go/src/github.com/docker/docker/vendor/github.com/docker/libnetwork/network.go:875 +0x44
github.com/docker/docker/vendor/github.com/docker/libnetwork.(*network).CreateEndpoint(0xc4210a9e00, 0xc420fc11d1, 0x21, 0xc4217311a0, 0x4, 0x4, 0x0, 0x0, 0x0, 0x0)
	/go/src/github.com/docker/docker/vendor/github.com/docker/libnetwork/network.go:950 +0xabc
github.com/docker/docker/daemon.(*Daemon).connectToNetwork(0xc420496200, 0xc4207bd400, 0x1164b5f6, 0x6, 0xc421714480, 0x0, 0x0, 0x0)
	/go/src/github.com/docker/docker/daemon/container_operations.go:735 +0x668
github.com/docker/docker/daemon.(*Daemon).allocateNetwork(0xc420496200, 0xc4207bd400, 0x0, 0x0)
	/go/src/github.com/docker/docker/daemon/container_operations.go:526 +0x378
github.com/docker/docker/daemon.(*Daemon).initializeNetworking(0xc420496200, 0xc4207bd400, 0x0, 0x0)
	/go/src/github.com/docker/docker/daemon/container_operations.go:903 +0x3d0
github.com/docker/docker/daemon.(*Daemon).containerStart(0xc420496200, 0xc4207bd400, 0x0, 0x0, 0x0, 0x0, 0x1, 0x0, 0x0)
	/go/src/github.com/docker/docker/daemon/start.go:140 +0x1f4
github.com/docker/docker/daemon.(*Daemon).ContainerStart(0xc420496200, 0xc421764000, 0x21, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
	/go/src/github.com/docker/docker/daemon/start.go:86 +0x5d4
github.com/docker/docker/daemon/cluster/executor/container.(*containerAdapter).start(0xc421085830, 0x3fff9e4d62b8, 0xc420d77640, 0x0, 0x0)
	/go/src/github.com/docker/docker/daemon/cluster/executor/container/adapter.go:285 +0xd8
github.com/docker/docker/daemon/cluster/executor/container.(*controller).Start(0xc420f264b0, 0x3fff9e4d62b8, 0xc420d77640, 0x0, 0x0)
	/go/src/github.com/docker/docker/daemon/cluster/executor/container/controller.go:189 +0x164
github.com/docker/docker/vendor/github.com/docker/swarmkit/agent/exec.Do(0x3fff9e4d62b8, 0xc420d77640, 0xc4210a9b00, 0x11fdf000, 0xc420f264b0, 0x0, 0x0, 0x0)
	/go/src/github.com/docker/docker/vendor/github.com/docker/swarmkit/agent/exec/controller.go:314 +0x720
github.com/docker/docker/vendor/github.com/docker/swarmkit/agent.(*taskManager).run.func2(0x3fff9e4a08e0, 0xc420f2b140, 0x0, 0x0)
	/go/src/github.com/docker/docker/vendor/github.com/docker/swarmkit/agent/task.go:134 +0x290
github.com/docker/docker/vendor/github.com/docker/swarmkit/agent.runctx(0x3fff9e4a08e0, 0xc420f2b140, 0xc420eafb00, 0xc420f60300, 0xc420d776c0)
	/go/src/github.com/docker/docker/vendor/github.com/docker/swarmkit/agent/helpers.go:9 +0x50
created by github.com/docker/docker/vendor/github.com/docker/swarmkit/agent.(*taskManager).run
	/go/src/github.com/docker/docker/vendor/github.com/docker/swarmkit/agent/task.go:150 +0x51c

The deadlock is that (*Cluster).Cleanup is holding c.mu, but cleaning up involves calling agent.Leave(), which may be blocked by the agent calling into libnetwork code, which needs to acquire the libnetwork controller's mutex. But it can't acquire that mutex because the goroutine shown above is holding it, trying to call into Cluster.

cc @tonistiigi

@aaronlehmann
Copy link

I don't think the deadlock discussed in the last few comments is related to the changes in this PR, so...

LGTM

But let's make sure to open an issue for the deadlock and try to fix it ASAP.

@aboch
Copy link
Contributor Author

aboch commented Apr 6, 2017

Yes, I was about to post the same findings you posted above

@aboch
Copy link
Contributor Author

aboch commented Apr 6, 2017

@aaronlehmann
Actually I do not think there is a deadlock involving the (*Cluster).mu here: (*Cluster).Cleanup is holding the c.mu.Lock() until it releases it right after the dump was printed. So of course during the dump it is holding the lock and IsAgentcalled from libnetwork is there waiting to acquire it as c.mu.RLock() in the trace dump.

It looks like the reason why the lock is being hold that long is only because the waiting for node.Stop() called in (*Cluster).Cleanup to timeout:

time="2017-04-06T00:23:25.759849000Z" level=error msg="failed to shut down cluster node: context deadline exceeded"

@aaronlehmann
Copy link

...but it times out because of the deadlock. That's the problem.

Copy link

@aaronlehmann aaronlehmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put my LGTM in a comment but forgot to do this...

Copy link
Member

@vdemeester vdemeester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🦁

@aboch
Copy link
Contributor Author

aboch commented Apr 13, 2017

This also fixes
#32111
#31913
#30314

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

priority/P1 Important: P1 issues are a top priority and a must-have for the next release. status/2-code-review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants