Skip to content

Docker swarm-mode 1.12.3 overlay network randomly dropping a random container from DNS #28843

@jgranstrom

Description

@jgranstrom

Description

Running a 3 node swarm mode set up on 1.12.3 with 3 machines on Ubuntu 16.04 provisioned by Digital Ocean. I run multiple swarm services with both global and replicated mode. About 5 tasks are running per machine. Each service is attached to a single overlay network created manually. Every service uses the VIP mode, not dnsrr.

The services are a wide variety of software like nginx, gitlab, redis, postgres, docker registry and so on.

At random occasions, without doing anything explicit with any of the machines, one container suddenly becomes unavailable from all containers using the service name, including from itself, "getaddrinfo: Name or service not known". After a random amount of time the service comes back available to all other running containers, without restarting it or doing anything on the server.

Steps to reproduce the issue:
Very difficult to reproduce since it is happening on random occasions without any changes in load or configuration.

Describe the results you received:
I found syslog entries matching the exact time of when the issue started, and when it was resolved.

The following output was found in the syslog at the starting time of the issue.

The machine where the disappearing container is running

time="2016-11-25T13:15:48.796549527Z" level=warning msg="2016/11/25 13:15:48 [WARN] memberlist: Refuting a suspect message (from: ada-e6c6e162514a)\n"

Note that this is printed on the node named ada, so it seems to refute a message from itself?

The two other machine which are both consuming the disappeared service container

Nov 25 13:15:43 alan dockerd[8592]: time="2016-11-25T13:15:43.606980021Z" level=info msg="2016/11/25 13:15:43 [INFO] memberlist: Suspect ada-e6c6e162514a has failed, no acks received\n"
Nov 25 13:15:48 alan dockerd[8592]: time="2016-11-25T13:15:48.608459690Z" level=info msg="2016/11/25 13:15:48 [INFO] memberlist: Marking ada-e6c6e162514a as failed, suspect timeout reached\n"
Nov 25 13:15:48 alan kernel: [341915.473234] IPVS: __ip_vs_del_service: enter
Nov 25 13:15:48 alan dockerd[8592]: time="2016-11-25T13:15:48Z" level=info msg="Firewalld running: false"
Nov 25 13:15:48 alan kernel: [341915.519195] IPVS: __ip_vs_del_service: enter
Nov 25 13:15:48 alan kernel: [341915.553478] IPVS: __ip_vs_del_service: enter
Nov 25 13:15:48 alan kernel: [341915.597754] IPVS: __ip_vs_del_service: enter
Nov 25 13:15:48 alan kernel: [341915.646862] IPVS: __ip_vs_del_service: enter
Nov 25 13:15:48 alan kernel: [341915.691013] IPVS: __ip_vs_del_service: enter
Nov 25 13:15:48 alan kernel: [341915.729924] IPVS: __ip_vs_del_service: enter
Nov 25 13:15:48 alan kernel: [341915.790222] IPVS: __ip_vs_del_service: enter
Nov 25 13:15:48 alan kernel: [341915.833920] IPVS: __ip_vs_del_service: enter
Nov 25 13:15:48 alan dockerd[8592]: message repeated 7 times: [ time="2016-11-25T13:15:48Z" level=info msg="Firewalld running: false"]
Nov 25 13:15:49 alan dockerd[8592]: time="2016-11-25T13:15:49Z" level=info msg="Firewalld running: false"
Nov 25 13:15:49 alan kernel: [341915.899899] IPVS: __ip_vs_del_service: enter
Nov 25 13:15:49 alan kernel: [341915.946886] IPVS: __ip_vs_del_service: enter
Nov 25 13:15:49 alan kernel: [341915.987816] IPVS: __ip_vs_del_service: enter
Nov 25 13:15:49 alan kernel: [341916.037331] IPVS: __ip_vs_del_service: enter
Nov 25 13:15:49 alan kernel: [341916.097550] IPVS: __ip_vs_del_service: enter
Nov 25 13:15:49 alan kernel: [341916.168667] IPVS: __ip_vs_del_service: enter
Nov 25 13:15:49 alan dockerd[8592]: message repeated 6 times: [ time="2016-11-25T13:15:49Z" level=info msg="Firewalld running: false"]

The following output was found in the syslog at the time of when the issue was resolved.

The machine where the disappearing container is running
No log output was found on this machine when it started working again.

The two other machine which are both consuming the disappeared service container

Nov 25 13:17:39 alan dockerd[8592]: time="2016-11-25T13:17:39Z" level=info msg="Firewalld running: false"
Nov 25 13:17:39 alan dockerd[8592]: message repeated 5 times: [ time="2016-11-25T13:17:39Z" level=info msg="Firewalld running: false"]
Nov 25 13:17:40 alan dockerd[8592]: time="2016-11-25T13:17:40Z" level=info msg="Firewalld running: false"
Nov 25 13:17:40 alan dockerd[8592]: message repeated 8 times: [ time="2016-11-25T13:17:40Z" level=info msg="Firewalld running: false"]

Between these two series of entries there was no log output on any of the machines.

An important aspect is that the container that was disappearing was fully up and functioning otherwise during this time.

Describe the results you expected:
The container to stay available via the DNS through its lifetime.

Additional information you deem important (e.g. issue happens only occasionally):
The issue happens to every type of container and on all 3 machines, on random occasions and for a random time period, from minutes to over an hour. The issue happens to only one container of the service even if the service have many replicas or are running in global mode across all machines. Though since the container is not actually going down it is not redistributed and has a great negative impact on the system.

The ufw is active and configured on all servers to allow the expected ports as such

7946                       ALLOW       Anywhere
4789                       ALLOW       Anywhere

Output of docker version:

Same on all machines.

Client:
 Version:      1.12.3
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   6b644ec
 Built:        Wed Oct 26 22:01:48 2016
 OS/Arch:      linux/amd64

Server:
 Version:      1.12.3
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   6b644ec
 Built:        Wed Oct 26 22:01:48 2016
 OS/Arch:      linux/amd64

Output of docker info:

Machine 1

Containers: 5
 Running: 5
 Paused: 0
 Stopped: 0
Images: 27
Server Version: 1.12.3
Storage Driver: devicemapper
 Pool Name: docker-253:1-1305602-pool
 Pool Blocksize: 65.54 kB
 Base Device Size: 10.74 GB
 Backing Filesystem: xfs
 Data file: /dev/loop0
 Metadata file: /dev/loop1
 Data Space Used: 5.691 GB
 Data Space Total: 107.4 GB
 Data Space Available: 101.7 GB
 Metadata Space Used: 8.499 MB
 Metadata Space Total: 2.147 GB
 Metadata Space Available: 2.139 GB
 Thin Pool Minimum Free Space: 10.74 GB
 Udev Sync Supported: true
 Deferred Removal Enabled: false
 Deferred Deletion Enabled: false
 Deferred Deleted Device Count: 0
 Data loop file: /var/lib/docker/devicemapper/devicemapper/data
 WARNING: Usage of loopback devices is strongly discouraged for production use. Use `--storage-opt dm.thinpooldev` to specify a custom block storage device.
 Metadata loop file: /var/lib/docker/devicemapper/devicemapper/metadata
 Library Version: 1.02.110 (2015-10-30)
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local glusterfs
 Network: host null overlay bridge
Swarm: active
 NodeID: 92y5qce46lxvk0ck0zx0kn9z8
 Is Manager: true
 ClusterID: dcxqjr7los0jifyyu44wfwpbx
 Managers: 1
 Nodes: 3
 Orchestration:
  Task History Retention Limit: 5
 Raft:
  Snapshot Interval: 10000
  Heartbeat Tick: 1
  Election Tick: 3
 Dispatcher:
  Heartbeat Period: 5 seconds
 CA Configuration:
  Expiry Duration: 3 months
 Node Address: 10.132.80.107
Runtimes: runc
Default Runtime: runc
Security Options: apparmor seccomp
Kernel Version: 4.4.0-47-generic
Operating System: Ubuntu 16.04.1 LTS
OSType: linux
Architecture: x86_64
CPUs: 12
Total Memory: 31.42 GiB
Name: alan
ID: MEKS:EJEM:6SW2:MWPY:YWII:MSN5:BMN5:M7R3:IHRE:V4MD:VA66:IGZQ
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
WARNING: No swap limit support
Insecure Registries:
 127.0.0.0/8

Machine 2

Containers: 13
 Running: 5
 Paused: 0
 Stopped: 8
Images: 18
Server Version: 1.12.3
Storage Driver: devicemapper
 Pool Name: docker-253:1-1182943-pool
 Pool Blocksize: 65.54 kB
 Base Device Size: 10.74 GB
 Backing Filesystem: xfs
 Data file: /dev/loop0
 Metadata file: /dev/loop1
 Data Space Used: 5.013 GB
 Data Space Total: 107.4 GB
 Data Space Available: 102.4 GB
 Metadata Space Used: 8.245 MB
 Metadata Space Total: 2.147 GB
 Metadata Space Available: 2.139 GB
 Thin Pool Minimum Free Space: 10.74 GB
 Udev Sync Supported: true
 Deferred Removal Enabled: false
 Deferred Deletion Enabled: false
 Deferred Deleted Device Count: 0
 Data loop file: /var/lib/docker/devicemapper/devicemapper/data
 WARNING: Usage of loopback devices is strongly discouraged for production use. Use `--storage-opt dm.thinpooldev` to specify a custom block storage device.
 Metadata loop file: /var/lib/docker/devicemapper/devicemapper/metadata
 Library Version: 1.02.110 (2015-10-30)
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local glusterfs
 Network: bridge null overlay host
Swarm: active
 NodeID: 84bz0uxm5ok2olurrrijsnx4e
 Is Manager: false
 Node Address: 10.132.76.139
Runtimes: runc
Default Runtime: runc
Security Options: apparmor seccomp
Kernel Version: 4.4.0-47-generic
Operating System: Ubuntu 16.04.1 LTS
OSType: linux
Architecture: x86_64
CPUs: 12
Total Memory: 31.42 GiB
Name: ada
ID: XVND:LYKQ:NO5T:LULW:CZFM:IIL3:GX7P:M2IK:H3RA:SGS4:BKX3:QEGY
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
WARNING: No swap limit support
Insecure Registries:
 127.0.0.0/8

Machine 3

Containers: 11
 Running: 5
 Paused: 0
 Stopped: 6
Images: 16
Server Version: 1.12.3
Storage Driver: devicemapper
 Pool Name: docker-253:1-1182943-pool
 Pool Blocksize: 65.54 kB
 Base Device Size: 10.74 GB
 Backing Filesystem: xfs
 Data file: /dev/loop0
 Metadata file: /dev/loop1
 Data Space Used: 3.682 GB
 Data Space Total: 107.4 GB
 Data Space Available: 103.7 GB
 Metadata Space Used: 7.623 MB
 Metadata Space Total: 2.147 GB
 Metadata Space Available: 2.14 GB
 Thin Pool Minimum Free Space: 10.74 GB
 Udev Sync Supported: true
 Deferred Removal Enabled: false
 Deferred Deletion Enabled: false
 Deferred Deleted Device Count: 0
 Data loop file: /var/lib/docker/devicemapper/devicemapper/data
 WARNING: Usage of loopback devices is strongly discouraged for production use. Use `--storage-opt dm.thinpooldev` to specify a custom block storage device.
 Metadata loop file: /var/lib/docker/devicemapper/devicemapper/metadata
 Library Version: 1.02.110 (2015-10-30)
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: glusterfs local
 Network: host bridge overlay null
Swarm: active
 NodeID: 9d4kl2dfvivvyo1kjugil9lnu
 Is Manager: false
 Node Address: 10.132.72.183
Runtimes: runc
Default Runtime: runc
Security Options: apparmor seccomp
Kernel Version: 4.4.0-47-generic
Operating System: Ubuntu 16.04.1 LTS
OSType: linux
Architecture: x86_64
CPUs: 12
Total Memory: 31.42 GiB
Name: marvin
ID: QMX3:X6SG:ESMN:LX5S:7QKW:LAKD:OD7E:6XOU:6J5V:2CME:7X56:MGWP
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
WARNING: No swap limit support
Insecure Registries:
 127.0.0.0/8

Additional environment details (AWS, VirtualBox, physical, etc.):
Digital Ocean provisioned Ubuntu 16.04.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions