-
Notifications
You must be signed in to change notification settings - Fork 18.9k
Description
Description
From some time (I believe after update from 19.03, but I'm not sure) we are experiencing strange Docker Swarm behavior. We are running a small Docker Swarm cluster, two workers, one manager. Nodes are in VMs, under Ubuntu 20.04, managed by Proxmox.
We can see that sometimes (can't tell why and when exactly) Docker Daemon become unresponsive. We experience it as a problem with DNS requests. All containers on that node are working well, they just can't query Docker's internal DNS. When this happens, I can't drain that node from manager - it is marked as Drain, but containers are not stopping. I can't stop containers with docker stop also. The only solution is to restart docker with systemctl, after that all is well for some time.
While restarting Docker I can see dump from go that's very long, but the most repeated part is:
Feb 16 11:41:06 docker-worker-2 dockerd[3187807]: goroutine 21208128 [semacquire, 7 minutes]:
Feb 16 11:41:06 docker-worker-2 dockerd[3187807]: sync.runtime_SemacquireMutex(0xc000f320fc, 0xc0027b9c00, 0x1)
Feb 16 11:41:06 docker-worker-2 dockerd[3187807]: /usr/local/go/src/runtime/sema.go:71 +0x49
Feb 16 11:41:06 docker-worker-2 dockerd[3187807]: sync.(*Mutex).lockSlow(0xc000f320f8)
Feb 16 11:41:06 docker-worker-2 dockerd[3187807]: /usr/local/go/src/sync/mutex.go:138 +0xfe
Feb 16 11:41:06 docker-worker-2 dockerd[3187807]: sync.(*Mutex).Lock(...)
Feb 16 11:41:06 docker-worker-2 dockerd[3187807]: /usr/local/go/src/sync/mutex.go:81
Feb 16 11:41:06 docker-worker-2 dockerd[3187807]: github.com/docker/docker/vendor/github.com/docker/libnetwork.(*controller).isManager(0xc000f32000, 0xc0061e7b00)
Feb 16 11:41:06 docker-worker-2 dockerd[3187807]: /go/src/github.com/docker/docker/vendor/github.com/docker/libnetwork/controller.go:666 +0xea
Feb 16 11:41:06 docker-worker-2 dockerd[3187807]: github.com/docker/docker/vendor/github.com/docker/libnetwork.(*controller).isDistributedControl(0xc000f32000, 0xc0061e7bb0)
Feb 16 11:41:06 docker-worker-2 dockerd[3187807]: /go/src/github.com/docker/docker/vendor/github.com/docker/libnetwork/controller.go:684 +0x2d
Feb 16 11:41:06 docker-worker-2 dockerd[3187807]: github.com/docker/docker/vendor/github.com/docker/libnetwork.(*sandbox).ResolveName(0xc0027b9b00, 0xc0090b1740, 0x9, 0x2, 0x3330333135613134, 0x8, 0x656532633>
Feb 16 11:41:06 docker-worker-2 dockerd[3187807]: /go/src/github.com/docker/docker/vendor/github.com/docker/libnetwork/sandbox.go:557 +0x4c2
Feb 16 11:41:06 docker-worker-2 dockerd[3187807]: github.com/docker/docker/vendor/github.com/docker/libnetwork.(*resolver).handleIPQuery(0xc00289b340, 0xc0090b1740, 0x9, 0xc00a4f4ea0, 0x2, 0x0, 0x0, 0x0)
Feb 16 11:41:06 docker-worker-2 dockerd[3187807]: /go/src/github.com/docker/docker/vendor/github.com/docker/libnetwork/resolver.go:246 +0x74
Feb 16 11:41:06 docker-worker-2 dockerd[3187807]: github.com/docker/docker/vendor/github.com/docker/libnetwork.(*resolver).ServeDNS(0xc00289b340, 0x564bc8bf8c80, 0xc006029920, 0xc00a4f4ea0)
Feb 16 11:41:06 docker-worker-2 dockerd[3187807]: /go/src/github.com/docker/docker/vendor/github.com/docker/libnetwork/resolver.go:375 +0x2027
Feb 16 11:41:06 docker-worker-2 dockerd[3187807]: github.com/docker/docker/vendor/github.com/miekg/dns.(*Server).serveDNS(0xc0061e3560, 0xc009239800, 0x1a, 0x200, 0xc006029920)
Feb 16 11:41:06 docker-worker-2 dockerd[3187807]: /go/src/github.com/docker/docker/vendor/github.com/miekg/dns/server.go:609 +0x2e2
Feb 16 11:41:06 docker-worker-2 dockerd[3187807]: github.com/docker/docker/vendor/github.com/miekg/dns.(*Server).serveUDPPacket(0xc0061e3560, 0xc001e08ba0, 0xc009239800, 0x1a, 0x200, 0xc0010c2450, 0xc00a0c724>
Feb 16 11:41:06 docker-worker-2 dockerd[3187807]: /go/src/github.com/docker/docker/vendor/github.com/miekg/dns/server.go:549 +0xb4
Feb 16 11:41:06 docker-worker-2 dockerd[3187807]: created by github.com/docker/docker/vendor/github.com/miekg/dns.(*Server).serveUDP
Feb 16 11:41:06 docker-worker-2 dockerd[3187807]: /go/src/github.com/docker/docker/vendor/github.com/miekg/dns/server.go:479 +0x2ae
Can't think of anything else. It seems to be related with a number of containers running. Our application is relying a lot on docker swarm internal restart policy. We have around 100 running containers per node. We use docker-gc to clean old containers periodically. Our containers are rather short lived - we use swarm's restart policy as an alternative for cron, so containers starts very often. We use quite a lot of NFS volumes in our services for data storage.
Steps to reproduce the issue:
- I wish I knew - in our case we just need to wait for this error to happen, restart docker, wait again and surely within 24 hours the problem will arise again.
Describe the results you received:
Apart from already described log I can't see anything relevant in logs. One of our network have quite a lot of entries in NetworkDB stats (more than 1500), but I can't see a pattern there.
We also see this logs when restarting docker:
Feb 16 11:40:50 docker-worker-2 dockerd[3187807]: time="2021-02-16T11:40:50.294200938+01:00" level=info msg="Processing signal 'terminated'"
Feb 16 11:41:05 docker-worker-2 dockerd[3187807]: time="2021-02-16T11:41:05.296233481+01:00" level=error msg="agent failed to clean up assignments" error="context deadline exceeded"
Feb 16 11:41:05 docker-worker-2 dockerd[3187807]: time="2021-02-16T11:41:05.297377446+01:00" level=error msg="failed to shut down cluster node: context deadline exceeded"
which seems relevant and looks like some problem with mutex and deadlock, but I don't now that much about Docker internals to pinpoint a problem.
Describe the results you expected:
The daemon is stable and DNS works as expected.
Additional information you deem important (e.g. issue happens only occasionally):
Issue happens quite regulary.
Output of docker version:
Client: Docker Engine - Community
Version: 20.10.3
API version: 1.41
Go version: go1.13.15
Git commit: 48d30b5
Built: Fri Jan 29 14:33:21 2021
OS/Arch: linux/amd64
Context: default
Experimental: true
Server: Docker Engine - Community
Engine:
Version: 20.10.3
API version: 1.41 (minimum version 1.12)
Go version: go1.13.15
Git commit: 46229ca
Built: Fri Jan 29 14:31:32 2021
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.4.3
GitCommit: 269548fa27e0089a8b8278fc4fc781d7f65a939b
runc:
Version: 1.0.0-rc92
GitCommit: ff819c7e9184c13b7c2607fe6c30ae19403a7aff
docker-init:
Version: 0.19.0
GitCommit: de40ad0
Output of docker info:
Client:
Context: default
Debug Mode: false
Plugins:
app: Docker App (Docker Inc., v0.9.1-beta3)
buildx: Build with BuildKit (Docker Inc., v0.5.1-docker)
Server:
Containers: 235
Running: 70
Paused: 0
Stopped: 165
Images: 14
Server Version: 20.10.3
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Cgroup Version: 1
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: active
NodeID: nnf06svm6en4chin0urdvo2hp
Is Manager: false
Node Address: 192.168.200.196
Manager Addresses:
192.168.200.217:2377
Runtimes: runc io.containerd.runc.v2 io.containerd.runtime.v1.linux
Default Runtime: runc
Init Binary: docker-init
containerd version: 269548fa27e0089a8b8278fc4fc781d7f65a939b
runc version: ff819c7e9184c13b7c2607fe6c30ae19403a7aff
init version: de40ad0
Security Options:
apparmor
seccomp
Profile: default
Kernel Version: 5.4.0-65-generic
Operating System: Ubuntu 20.04.2 LTS
OSType: linux
Architecture: x86_64
CPUs: 64
Total Memory: 62.8GiB
Name: docker-worker-2
ID: DXBV:NFMD:BEIL:DL2N:5YFZ:4P34:CXJO:B63Y:6RFY:KBV5:ZSBS:WCPR
Docker Root Dir: /var/lib/docker
Debug Mode: false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
hub.dmsales.io:5000
192.168.200.217:5000
127.0.0.0/8
Live Restore Enabled: false
WARNING: No swap limit support
WARNING: No blkio weight support
WARNING: No blkio weight_device support
Additional environment details (AWS, VirtualBox, physical, etc.):
VM, running Ubuntu 20.04, managed by Proxmox 6.2-12