Skip to content

Docker deamon get's stuck, can't serve DNS requests  #42032

@nrafal

Description

@nrafal

Description
From some time (I believe after update from 19.03, but I'm not sure) we are experiencing strange Docker Swarm behavior. We are running a small Docker Swarm cluster, two workers, one manager. Nodes are in VMs, under Ubuntu 20.04, managed by Proxmox.

We can see that sometimes (can't tell why and when exactly) Docker Daemon become unresponsive. We experience it as a problem with DNS requests. All containers on that node are working well, they just can't query Docker's internal DNS. When this happens, I can't drain that node from manager - it is marked as Drain, but containers are not stopping. I can't stop containers with docker stop also. The only solution is to restart docker with systemctl, after that all is well for some time.

While restarting Docker I can see dump from go that's very long, but the most repeated part is:

Feb 16 11:41:06 docker-worker-2 dockerd[3187807]: goroutine 21208128 [semacquire, 7 minutes]:
Feb 16 11:41:06 docker-worker-2 dockerd[3187807]: sync.runtime_SemacquireMutex(0xc000f320fc, 0xc0027b9c00, 0x1)
Feb 16 11:41:06 docker-worker-2 dockerd[3187807]:         /usr/local/go/src/runtime/sema.go:71 +0x49
Feb 16 11:41:06 docker-worker-2 dockerd[3187807]: sync.(*Mutex).lockSlow(0xc000f320f8)
Feb 16 11:41:06 docker-worker-2 dockerd[3187807]:         /usr/local/go/src/sync/mutex.go:138 +0xfe
Feb 16 11:41:06 docker-worker-2 dockerd[3187807]: sync.(*Mutex).Lock(...)
Feb 16 11:41:06 docker-worker-2 dockerd[3187807]:         /usr/local/go/src/sync/mutex.go:81
Feb 16 11:41:06 docker-worker-2 dockerd[3187807]: github.com/docker/docker/vendor/github.com/docker/libnetwork.(*controller).isManager(0xc000f32000, 0xc0061e7b00)
Feb 16 11:41:06 docker-worker-2 dockerd[3187807]:         /go/src/github.com/docker/docker/vendor/github.com/docker/libnetwork/controller.go:666 +0xea
Feb 16 11:41:06 docker-worker-2 dockerd[3187807]: github.com/docker/docker/vendor/github.com/docker/libnetwork.(*controller).isDistributedControl(0xc000f32000, 0xc0061e7bb0)
Feb 16 11:41:06 docker-worker-2 dockerd[3187807]:         /go/src/github.com/docker/docker/vendor/github.com/docker/libnetwork/controller.go:684 +0x2d
Feb 16 11:41:06 docker-worker-2 dockerd[3187807]: github.com/docker/docker/vendor/github.com/docker/libnetwork.(*sandbox).ResolveName(0xc0027b9b00, 0xc0090b1740, 0x9, 0x2, 0x3330333135613134, 0x8, 0x656532633>
Feb 16 11:41:06 docker-worker-2 dockerd[3187807]:         /go/src/github.com/docker/docker/vendor/github.com/docker/libnetwork/sandbox.go:557 +0x4c2
Feb 16 11:41:06 docker-worker-2 dockerd[3187807]: github.com/docker/docker/vendor/github.com/docker/libnetwork.(*resolver).handleIPQuery(0xc00289b340, 0xc0090b1740, 0x9, 0xc00a4f4ea0, 0x2, 0x0, 0x0, 0x0)
Feb 16 11:41:06 docker-worker-2 dockerd[3187807]:         /go/src/github.com/docker/docker/vendor/github.com/docker/libnetwork/resolver.go:246 +0x74
Feb 16 11:41:06 docker-worker-2 dockerd[3187807]: github.com/docker/docker/vendor/github.com/docker/libnetwork.(*resolver).ServeDNS(0xc00289b340, 0x564bc8bf8c80, 0xc006029920, 0xc00a4f4ea0)
Feb 16 11:41:06 docker-worker-2 dockerd[3187807]:         /go/src/github.com/docker/docker/vendor/github.com/docker/libnetwork/resolver.go:375 +0x2027
Feb 16 11:41:06 docker-worker-2 dockerd[3187807]: github.com/docker/docker/vendor/github.com/miekg/dns.(*Server).serveDNS(0xc0061e3560, 0xc009239800, 0x1a, 0x200, 0xc006029920)
Feb 16 11:41:06 docker-worker-2 dockerd[3187807]:         /go/src/github.com/docker/docker/vendor/github.com/miekg/dns/server.go:609 +0x2e2
Feb 16 11:41:06 docker-worker-2 dockerd[3187807]: github.com/docker/docker/vendor/github.com/miekg/dns.(*Server).serveUDPPacket(0xc0061e3560, 0xc001e08ba0, 0xc009239800, 0x1a, 0x200, 0xc0010c2450, 0xc00a0c724>
Feb 16 11:41:06 docker-worker-2 dockerd[3187807]:         /go/src/github.com/docker/docker/vendor/github.com/miekg/dns/server.go:549 +0xb4
Feb 16 11:41:06 docker-worker-2 dockerd[3187807]: created by github.com/docker/docker/vendor/github.com/miekg/dns.(*Server).serveUDP
Feb 16 11:41:06 docker-worker-2 dockerd[3187807]:         /go/src/github.com/docker/docker/vendor/github.com/miekg/dns/server.go:479 +0x2ae

Can't think of anything else. It seems to be related with a number of containers running. Our application is relying a lot on docker swarm internal restart policy. We have around 100 running containers per node. We use docker-gc to clean old containers periodically. Our containers are rather short lived - we use swarm's restart policy as an alternative for cron, so containers starts very often. We use quite a lot of NFS volumes in our services for data storage.

Steps to reproduce the issue:

  1. I wish I knew - in our case we just need to wait for this error to happen, restart docker, wait again and surely within 24 hours the problem will arise again.

Describe the results you received:
Apart from already described log I can't see anything relevant in logs. One of our network have quite a lot of entries in NetworkDB stats (more than 1500), but I can't see a pattern there.

We also see this logs when restarting docker:

Feb 16 11:40:50 docker-worker-2 dockerd[3187807]: time="2021-02-16T11:40:50.294200938+01:00" level=info msg="Processing signal 'terminated'"
Feb 16 11:41:05 docker-worker-2 dockerd[3187807]: time="2021-02-16T11:41:05.296233481+01:00" level=error msg="agent failed to clean up assignments" error="context deadline exceeded"
Feb 16 11:41:05 docker-worker-2 dockerd[3187807]: time="2021-02-16T11:41:05.297377446+01:00" level=error msg="failed to shut down cluster node: context deadline exceeded"

which seems relevant and looks like some problem with mutex and deadlock, but I don't now that much about Docker internals to pinpoint a problem.

Describe the results you expected:
The daemon is stable and DNS works as expected.

Additional information you deem important (e.g. issue happens only occasionally):
Issue happens quite regulary.

Output of docker version:

Client: Docker Engine - Community
 Version:           20.10.3
 API version:       1.41
 Go version:        go1.13.15
 Git commit:        48d30b5
 Built:             Fri Jan 29 14:33:21 2021
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server: Docker Engine - Community
 Engine:
  Version:          20.10.3
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.13.15
  Git commit:       46229ca
  Built:            Fri Jan 29 14:31:32 2021
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.4.3
  GitCommit:        269548fa27e0089a8b8278fc4fc781d7f65a939b
 runc:
  Version:          1.0.0-rc92
  GitCommit:        ff819c7e9184c13b7c2607fe6c30ae19403a7aff
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

Output of docker info:

Client:
 Context:    default
 Debug Mode: false
 Plugins:
  app: Docker App (Docker Inc., v0.9.1-beta3)
  buildx: Build with BuildKit (Docker Inc., v0.5.1-docker)

Server:
 Containers: 235
  Running: 70
  Paused: 0
  Stopped: 165
 Images: 14
 Server Version: 20.10.3
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: true
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 1
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: active
  NodeID: nnf06svm6en4chin0urdvo2hp
  Is Manager: false
  Node Address: 192.168.200.196
  Manager Addresses:
   192.168.200.217:2377
 Runtimes: runc io.containerd.runc.v2 io.containerd.runtime.v1.linux
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 269548fa27e0089a8b8278fc4fc781d7f65a939b
 runc version: ff819c7e9184c13b7c2607fe6c30ae19403a7aff
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: default
 Kernel Version: 5.4.0-65-generic
 Operating System: Ubuntu 20.04.2 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 64
 Total Memory: 62.8GiB
 Name: docker-worker-2
 ID: DXBV:NFMD:BEIL:DL2N:5YFZ:4P34:CXJO:B63Y:6RFY:KBV5:ZSBS:WCPR
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  hub.dmsales.io:5000
  192.168.200.217:5000
  127.0.0.0/8
 Live Restore Enabled: false

WARNING: No swap limit support
WARNING: No blkio weight support
WARNING: No blkio weight_device support

Additional environment details (AWS, VirtualBox, physical, etc.):
VM, running Ubuntu 20.04, managed by Proxmox 6.2-12

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/networkingNetworkingkind/bugBugs are bugs. The cause may or may not be known at triage time so debugging may be needed.version/20.10

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions