Description
Running a 3 node swarm mode set up on 1.12.3 with 3 machines on Ubuntu 16.04 provisioned by Digital Ocean. I run multiple swarm services with both global and replicated mode. About 5 tasks are running per machine. Each service is attached to a single overlay network created manually. Every service uses the VIP mode, not dnsrr.
The services are a wide variety of software like nginx, gitlab, redis, postgres, docker registry and so on.
At random occasions, without doing anything explicit with any of the machines, one container suddenly becomes unavailable from all containers using the service name, including from itself, "getaddrinfo: Name or service not known". After a random amount of time the service comes back available to all other running containers, without restarting it or doing anything on the server.
Steps to reproduce the issue:
Very difficult to reproduce since it is happening on random occasions without any changes in load or configuration.
Describe the results you received:
I found syslog entries matching the exact time of when the issue started, and when it was resolved.
The following output was found in the syslog at the starting time of the issue.
The machine where the disappearing container is running
time="2016-11-25T13:15:48.796549527Z" level=warning msg="2016/11/25 13:15:48 [WARN] memberlist: Refuting a suspect message (from: ada-e6c6e162514a)\n"
Note that this is printed on the node named ada, so it seems to refute a message from itself?
The two other machine which are both consuming the disappeared service container
Nov 25 13:15:43 alan dockerd[8592]: time="2016-11-25T13:15:43.606980021Z" level=info msg="2016/11/25 13:15:43 [INFO] memberlist: Suspect ada-e6c6e162514a has failed, no acks received\n"
Nov 25 13:15:48 alan dockerd[8592]: time="2016-11-25T13:15:48.608459690Z" level=info msg="2016/11/25 13:15:48 [INFO] memberlist: Marking ada-e6c6e162514a as failed, suspect timeout reached\n"
Nov 25 13:15:48 alan kernel: [341915.473234] IPVS: __ip_vs_del_service: enter
Nov 25 13:15:48 alan dockerd[8592]: time="2016-11-25T13:15:48Z" level=info msg="Firewalld running: false"
Nov 25 13:15:48 alan kernel: [341915.519195] IPVS: __ip_vs_del_service: enter
Nov 25 13:15:48 alan kernel: [341915.553478] IPVS: __ip_vs_del_service: enter
Nov 25 13:15:48 alan kernel: [341915.597754] IPVS: __ip_vs_del_service: enter
Nov 25 13:15:48 alan kernel: [341915.646862] IPVS: __ip_vs_del_service: enter
Nov 25 13:15:48 alan kernel: [341915.691013] IPVS: __ip_vs_del_service: enter
Nov 25 13:15:48 alan kernel: [341915.729924] IPVS: __ip_vs_del_service: enter
Nov 25 13:15:48 alan kernel: [341915.790222] IPVS: __ip_vs_del_service: enter
Nov 25 13:15:48 alan kernel: [341915.833920] IPVS: __ip_vs_del_service: enter
Nov 25 13:15:48 alan dockerd[8592]: message repeated 7 times: [ time="2016-11-25T13:15:48Z" level=info msg="Firewalld running: false"]
Nov 25 13:15:49 alan dockerd[8592]: time="2016-11-25T13:15:49Z" level=info msg="Firewalld running: false"
Nov 25 13:15:49 alan kernel: [341915.899899] IPVS: __ip_vs_del_service: enter
Nov 25 13:15:49 alan kernel: [341915.946886] IPVS: __ip_vs_del_service: enter
Nov 25 13:15:49 alan kernel: [341915.987816] IPVS: __ip_vs_del_service: enter
Nov 25 13:15:49 alan kernel: [341916.037331] IPVS: __ip_vs_del_service: enter
Nov 25 13:15:49 alan kernel: [341916.097550] IPVS: __ip_vs_del_service: enter
Nov 25 13:15:49 alan kernel: [341916.168667] IPVS: __ip_vs_del_service: enter
Nov 25 13:15:49 alan dockerd[8592]: message repeated 6 times: [ time="2016-11-25T13:15:49Z" level=info msg="Firewalld running: false"]
The following output was found in the syslog at the time of when the issue was resolved.
The machine where the disappearing container is running
No log output was found on this machine when it started working again.
The two other machine which are both consuming the disappeared service container
Nov 25 13:17:39 alan dockerd[8592]: time="2016-11-25T13:17:39Z" level=info msg="Firewalld running: false"
Nov 25 13:17:39 alan dockerd[8592]: message repeated 5 times: [ time="2016-11-25T13:17:39Z" level=info msg="Firewalld running: false"]
Nov 25 13:17:40 alan dockerd[8592]: time="2016-11-25T13:17:40Z" level=info msg="Firewalld running: false"
Nov 25 13:17:40 alan dockerd[8592]: message repeated 8 times: [ time="2016-11-25T13:17:40Z" level=info msg="Firewalld running: false"]
Between these two series of entries there was no log output on any of the machines.
An important aspect is that the container that was disappearing was fully up and functioning otherwise during this time.
Describe the results you expected:
The container to stay available via the DNS through its lifetime.
Additional information you deem important (e.g. issue happens only occasionally):
The issue happens to every type of container and on all 3 machines, on random occasions and for a random time period, from minutes to over an hour. The issue happens to only one container of the service even if the service have many replicas or are running in global mode across all machines. Though since the container is not actually going down it is not redistributed and has a great negative impact on the system.
The ufw is active and configured on all servers to allow the expected ports as such
7946 ALLOW Anywhere
4789 ALLOW Anywhere
Output of docker version:
Same on all machines.
Client:
Version: 1.12.3
API version: 1.24
Go version: go1.6.3
Git commit: 6b644ec
Built: Wed Oct 26 22:01:48 2016
OS/Arch: linux/amd64
Server:
Version: 1.12.3
API version: 1.24
Go version: go1.6.3
Git commit: 6b644ec
Built: Wed Oct 26 22:01:48 2016
OS/Arch: linux/amd64
Output of docker info:
Machine 1
Containers: 5
Running: 5
Paused: 0
Stopped: 0
Images: 27
Server Version: 1.12.3
Storage Driver: devicemapper
Pool Name: docker-253:1-1305602-pool
Pool Blocksize: 65.54 kB
Base Device Size: 10.74 GB
Backing Filesystem: xfs
Data file: /dev/loop0
Metadata file: /dev/loop1
Data Space Used: 5.691 GB
Data Space Total: 107.4 GB
Data Space Available: 101.7 GB
Metadata Space Used: 8.499 MB
Metadata Space Total: 2.147 GB
Metadata Space Available: 2.139 GB
Thin Pool Minimum Free Space: 10.74 GB
Udev Sync Supported: true
Deferred Removal Enabled: false
Deferred Deletion Enabled: false
Deferred Deleted Device Count: 0
Data loop file: /var/lib/docker/devicemapper/devicemapper/data
WARNING: Usage of loopback devices is strongly discouraged for production use. Use `--storage-opt dm.thinpooldev` to specify a custom block storage device.
Metadata loop file: /var/lib/docker/devicemapper/devicemapper/metadata
Library Version: 1.02.110 (2015-10-30)
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local glusterfs
Network: host null overlay bridge
Swarm: active
NodeID: 92y5qce46lxvk0ck0zx0kn9z8
Is Manager: true
ClusterID: dcxqjr7los0jifyyu44wfwpbx
Managers: 1
Nodes: 3
Orchestration:
Task History Retention Limit: 5
Raft:
Snapshot Interval: 10000
Heartbeat Tick: 1
Election Tick: 3
Dispatcher:
Heartbeat Period: 5 seconds
CA Configuration:
Expiry Duration: 3 months
Node Address: 10.132.80.107
Runtimes: runc
Default Runtime: runc
Security Options: apparmor seccomp
Kernel Version: 4.4.0-47-generic
Operating System: Ubuntu 16.04.1 LTS
OSType: linux
Architecture: x86_64
CPUs: 12
Total Memory: 31.42 GiB
Name: alan
ID: MEKS:EJEM:6SW2:MWPY:YWII:MSN5:BMN5:M7R3:IHRE:V4MD:VA66:IGZQ
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
WARNING: No swap limit support
Insecure Registries:
127.0.0.0/8
Machine 2
Containers: 13
Running: 5
Paused: 0
Stopped: 8
Images: 18
Server Version: 1.12.3
Storage Driver: devicemapper
Pool Name: docker-253:1-1182943-pool
Pool Blocksize: 65.54 kB
Base Device Size: 10.74 GB
Backing Filesystem: xfs
Data file: /dev/loop0
Metadata file: /dev/loop1
Data Space Used: 5.013 GB
Data Space Total: 107.4 GB
Data Space Available: 102.4 GB
Metadata Space Used: 8.245 MB
Metadata Space Total: 2.147 GB
Metadata Space Available: 2.139 GB
Thin Pool Minimum Free Space: 10.74 GB
Udev Sync Supported: true
Deferred Removal Enabled: false
Deferred Deletion Enabled: false
Deferred Deleted Device Count: 0
Data loop file: /var/lib/docker/devicemapper/devicemapper/data
WARNING: Usage of loopback devices is strongly discouraged for production use. Use `--storage-opt dm.thinpooldev` to specify a custom block storage device.
Metadata loop file: /var/lib/docker/devicemapper/devicemapper/metadata
Library Version: 1.02.110 (2015-10-30)
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local glusterfs
Network: bridge null overlay host
Swarm: active
NodeID: 84bz0uxm5ok2olurrrijsnx4e
Is Manager: false
Node Address: 10.132.76.139
Runtimes: runc
Default Runtime: runc
Security Options: apparmor seccomp
Kernel Version: 4.4.0-47-generic
Operating System: Ubuntu 16.04.1 LTS
OSType: linux
Architecture: x86_64
CPUs: 12
Total Memory: 31.42 GiB
Name: ada
ID: XVND:LYKQ:NO5T:LULW:CZFM:IIL3:GX7P:M2IK:H3RA:SGS4:BKX3:QEGY
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
WARNING: No swap limit support
Insecure Registries:
127.0.0.0/8
Machine 3
Containers: 11
Running: 5
Paused: 0
Stopped: 6
Images: 16
Server Version: 1.12.3
Storage Driver: devicemapper
Pool Name: docker-253:1-1182943-pool
Pool Blocksize: 65.54 kB
Base Device Size: 10.74 GB
Backing Filesystem: xfs
Data file: /dev/loop0
Metadata file: /dev/loop1
Data Space Used: 3.682 GB
Data Space Total: 107.4 GB
Data Space Available: 103.7 GB
Metadata Space Used: 7.623 MB
Metadata Space Total: 2.147 GB
Metadata Space Available: 2.14 GB
Thin Pool Minimum Free Space: 10.74 GB
Udev Sync Supported: true
Deferred Removal Enabled: false
Deferred Deletion Enabled: false
Deferred Deleted Device Count: 0
Data loop file: /var/lib/docker/devicemapper/devicemapper/data
WARNING: Usage of loopback devices is strongly discouraged for production use. Use `--storage-opt dm.thinpooldev` to specify a custom block storage device.
Metadata loop file: /var/lib/docker/devicemapper/devicemapper/metadata
Library Version: 1.02.110 (2015-10-30)
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: glusterfs local
Network: host bridge overlay null
Swarm: active
NodeID: 9d4kl2dfvivvyo1kjugil9lnu
Is Manager: false
Node Address: 10.132.72.183
Runtimes: runc
Default Runtime: runc
Security Options: apparmor seccomp
Kernel Version: 4.4.0-47-generic
Operating System: Ubuntu 16.04.1 LTS
OSType: linux
Architecture: x86_64
CPUs: 12
Total Memory: 31.42 GiB
Name: marvin
ID: QMX3:X6SG:ESMN:LX5S:7QKW:LAKD:OD7E:6XOU:6J5V:2CME:7X56:MGWP
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
WARNING: No swap limit support
Insecure Registries:
127.0.0.0/8
Additional environment details (AWS, VirtualBox, physical, etc.):
Digital Ocean provisioned Ubuntu 16.04.
Description
Running a 3 node swarm mode set up on 1.12.3 with 3 machines on Ubuntu 16.04 provisioned by Digital Ocean. I run multiple swarm services with both global and replicated mode. About 5 tasks are running per machine. Each service is attached to a single overlay network created manually. Every service uses the VIP mode, not dnsrr.
The services are a wide variety of software like nginx, gitlab, redis, postgres, docker registry and so on.
At random occasions, without doing anything explicit with any of the machines, one container suddenly becomes unavailable from all containers using the service name, including from itself, "getaddrinfo: Name or service not known". After a random amount of time the service comes back available to all other running containers, without restarting it or doing anything on the server.
Steps to reproduce the issue:
Very difficult to reproduce since it is happening on random occasions without any changes in load or configuration.
Describe the results you received:
I found syslog entries matching the exact time of when the issue started, and when it was resolved.
The following output was found in the syslog at the starting time of the issue.
The machine where the disappearing container is running
Note that this is printed on the node named
ada, so it seems to refute a message from itself?The two other machine which are both consuming the disappeared service container
The following output was found in the syslog at the time of when the issue was resolved.
The machine where the disappearing container is running
No log output was found on this machine when it started working again.
The two other machine which are both consuming the disappeared service container
Between these two series of entries there was no log output on any of the machines.
An important aspect is that the container that was disappearing was fully up and functioning otherwise during this time.
Describe the results you expected:
The container to stay available via the DNS through its lifetime.
Additional information you deem important (e.g. issue happens only occasionally):
The issue happens to every type of container and on all 3 machines, on random occasions and for a random time period, from minutes to over an hour. The issue happens to only one container of the service even if the service have many replicas or are running in global mode across all machines. Though since the container is not actually going down it is not redistributed and has a great negative impact on the system.
The ufw is active and configured on all servers to allow the expected ports as such
Output of
docker version:Same on all machines.
Output of
docker info:Machine 1
Machine 2
Machine 3
Additional environment details (AWS, VirtualBox, physical, etc.):
Digital Ocean provisioned Ubuntu 16.04.