Skip to content

Swarm is having occasional network connection problems between nodes. #32195

@darklow

Description

@darklow

Few times a day I am having connection issues between nodes and clients are seeing occasional "Bad request" error. My swarm setup (aws) has following services: nginx (global) and web (replicated=2) and separate overlay network. In nginx.conf I am using proxy_pass http://web:5000 to route requests to web service. Both services are running and marked as healthy and haven't been restarted while having these errors. Manager is separate node (30sec-manager1).

Few times a day for few requests I am receiving an errors that nginx couldn't connect upstream and I always see 10.0.0.6 IP address mentioned:

Here are related nginx and docker logs. Both web services are replicated on 30sec-worker3 and 30sec-worker4 nodes.

Nginx log:
----------
2017/03/29 07:13:18 [error] 7#7: *44944 connect() failed (113: Host is unreachable) while connecting to upstream, client: 104.154.58.95, server: 30seconds.com, request: "GET / HTTP/1.1", upstream: "http://10.0.0.6:5000/", host: "30seconds.com"

Around same time from docker logs (journalctl -u docker.service)

on node 30sec-manager1:
---------------------------
Mar 29 07:12:50 30sec-manager1 docker[30365]: time="2017-03-29T07:12:50.736935344Z" level=warning msg="memberlist: Refuting a suspect message (from: 30sec-worker3-054c94d39b58)"
Mar 29 07:12:54 30sec-manager1 docker[30365]: time="2017-03-29T07:12:54.659229055Z" level=info msg="memberlist: Marking 30sec-worker3-054c94d39b58 as failed, suspect timeout reached"
Mar 29 07:12:54 30sec-manager1 docker[30365]: time="2017-03-29T07:12:54Z" level=info msg="Firewalld running: false"
Mar 29 07:12:54 30sec-manager1 docker[30365]: time="2017-03-29T07:12:54Z" level=info msg="Firewalld running: false"
Mar 29 07:12:54 30sec-manager1 docker[30365]: time="2017-03-29T07:12:54Z" level=info msg="Firewalld running: false"
Mar 29 07:12:54 30sec-manager1 docker[30365]: time="2017-03-29T07:12:54Z" level=info msg="Firewalld running: false"
Mar 29 07:12:54 30sec-manager1 docker[30365]: time="2017-03-29T07:12:54Z" level=info msg="Firewalld running: false"
Mar 29 07:12:54 30sec-manager1 docker[30365]: time="2017-03-29T07:12:54Z" level=info msg="Firewalld running: false"
Mar 29 07:13:10 30sec-manager1 docker[30365]: time="2017-03-29T07:13:10.302960985Z" level=info msg="memberlist: Suspect 30sec-worker3-054c94d39b58 has failed, no acks received"
Mar 29 07:13:11 30sec-manager1 docker[30365]: time="2017-03-29T07:13:11.055187819Z" level=warning msg="memberlist: Refuting a suspect message (from: 30sec-worker3-054c94d39b58)"
Mar 29 07:13:14 30sec-manager1 docker[30365]: time="2017-03-29T07:13:14Z" level=info msg="Firewalld running: false"
Mar 29 07:13:14 30sec-manager1 docker[30365]: time="2017-03-29T07:13:14Z" level=info msg="Firewalld running: false"
Mar 29 07:13:14 30sec-manager1 docker[30365]: time="2017-03-29T07:13:14Z" level=info msg="Firewalld running: false"
Mar 29 07:13:14 30sec-manager1 docker[30365]: time="2017-03-29T07:13:14Z" level=info msg="Firewalld running: false"
Mar 29 07:13:14 30sec-manager1 docker[30365]: time="2017-03-29T07:13:14Z" level=info msg="Firewalld running: false"
Mar 29 07:13:14 30sec-manager1 docker[30365]: time="2017-03-29T07:13:14Z" level=info msg="Firewalld running: false"
Mar 29 07:13:14 30sec-manager1 docker[30365]: time="2017-03-29T07:13:14Z" level=info msg="Firewalld running: false"
Mar 29 07:13:16 30sec-manager1 docker[30365]: time="2017-03-29T07:13:16Z" level=info msg="Firewalld running: false"
Mar 29 07:13:16 30sec-manager1 docker[30365]: time="2017-03-29T07:13:16Z" level=info msg="Firewalld running: false"
Mar 29 07:13:16 30sec-manager1 docker[30365]: time="2017-03-29T07:13:16Z" level=info msg="Firewalld running: false"
Mar 29 07:13:16 30sec-manager1 docker[30365]: time="2017-03-29T07:13:16Z" level=info msg="Firewalld running: false"
Mar 29 07:13:16 30sec-manager1 docker[30365]: time="2017-03-29T07:13:16Z" level=info msg="Firewalld running: false"
Mar 29 07:13:16 30sec-manager1 docker[30365]: time="2017-03-29T07:13:16Z" level=info msg="Firewalld running: false"
Mar 29 07:13:16 30sec-manager1 docker[30365]: time="2017-03-29T07:13:16Z" level=info msg="Firewalld running: false"
Mar 29 07:13:16 30sec-manager1 docker[30365]: time="2017-03-29T07:13:16Z" level=info msg="Firewalld running: false"
Mar 29 07:13:16 30sec-manager1 docker[30365]: time="2017-03-29T07:13:16Z" level=info msg="Firewalld running: false"
Mar 29 07:13:16 30sec-manager1 docker[30365]: time="2017-03-29T07:13:16Z" level=info msg="Firewalld running: false"
Mar 29 07:13:16 30sec-manager1 docker[30365]: time="2017-03-29T07:13:16Z" level=info msg="Firewalld running: false"
Mar 29 07:13:16 30sec-manager1 docker[30365]: time="2017-03-29T07:13:16Z" level=info msg="Firewalld running: false"
Mar 29 07:13:17 30sec-manager1 docker[30365]: time="2017-03-29T07:13:17Z" level=info msg="Firewalld running: false"

on node 30sec-worker3:
-------------------------
Mar 29 07:12:50 30sec-worker3 docker[30362]: time="2017-03-29T07:12:50.613402284Z" level=info msg="memberlist: Suspect 30sec-manager1-b1cbc10665cc has failed, no acks received"
Mar 29 07:12:55 30sec-worker3 docker[30362]: time="2017-03-29T07:12:55.614174704Z" level=warning msg="memberlist: Refuting a dead message (from: 30sec-worker4-4ca6b1dcaa42)"
Mar 29 07:13:09 30sec-worker3 docker[30362]: time="2017-03-29T07:13:09.613368306Z" level=info msg="memberlist: Suspect 30sec-worker4-4ca6b1dcaa42 has failed, no acks received"
Mar 29 07:13:10 30sec-worker3 docker[30362]: time="2017-03-29T07:13:10.613972658Z" level=info msg="memberlist: Suspect 30sec-manager1-b1cbc10665cc has failed, no acks received"
Mar 29 07:13:11 30sec-worker3 docker[30362]: time="2017-03-29T07:13:11.042788976Z" level=warning msg="memberlist: Refuting a suspect message (from: 30sec-worker4-4ca6b1dcaa42)"
Mar 29 07:13:14 30sec-worker3 docker[30362]: time="2017-03-29T07:13:14.613951134Z" level=info msg="memberlist: Marking 30sec-worker4-4ca6b1dcaa42 as failed, suspect timeout reached"
Mar 29 07:13:25 30sec-worker3 docker[30362]: time="2017-03-29T07:13:25.615128313Z" level=error msg="Bulk sync to node 30sec-manager1-b1cbc10665cc timed out"

on node 30sec-worker4:
-------------------------
Mar 29 07:12:49 30sec-worker4 docker[30376]: time="2017-03-29T07:12:49.658082975Z" level=info msg="memberlist: Suspect 30sec-worker3-054c94d39b58 has failed, no acks received"
Mar 29 07:12:54 30sec-worker4 docker[30376]: time="2017-03-29T07:12:54.658737367Z" level=info msg="memberlist: Marking 30sec-worker3-054c94d39b58 as failed, suspect timeout reached"
Mar 29 07:12:54 30sec-worker4 docker[30376]: time="2017-03-29T07:12:54Z" level=info msg="Firewalld running: false"
Mar 29 07:12:54 30sec-worker4 docker[30376]: time="2017-03-29T07:12:54Z" level=info msg="Firewalld running: false"
Mar 29 07:12:54 30sec-worker4 docker[30376]: time="2017-03-29T07:12:54Z" level=info msg="Firewalld running: false"
Mar 29 07:12:54 30sec-worker4 docker[30376]: time="2017-03-29T07:12:54Z" level=info msg="Firewalld running: false"
Mar 29 07:12:54 30sec-worker4 docker[30376]: time="2017-03-29T07:12:54Z" level=info msg="Firewalld running: false"
Mar 29 07:12:54 30sec-worker4 docker[30376]: time="2017-03-29T07:12:54Z" level=info msg="Firewalld running: false"
Mar 29 07:12:54 30sec-worker4 docker[30376]: time="2017-03-29T07:12:54Z" level=info msg="Firewalld running: false"
Mar 29 07:12:54 30sec-worker4 docker[30376]: time="2017-03-29T07:12:54Z" level=info msg="Firewalld running: false"
Mar 29 07:13:09 30sec-worker4 docker[30376]: time="2017-03-29T07:13:09.658056735Z" level=info msg="memberlist: Suspect 30sec-worker3-054c94d39b58 has failed, no acks received"
Mar 29 07:13:16 30sec-worker4 docker[30376]: time="2017-03-29T07:13:16.303689665Z" level=warning msg="memberlist: Refuting a suspect message (from: 30sec-worker4-4ca6b1dcaa42)"
Mar 29 07:13:16 30sec-worker4 docker[30376]: time="2017-03-29T07:13:16Z" level=info msg="Firewalld running: false"
Mar 29 07:13:16 30sec-worker4 docker[30376]: time="2017-03-29T07:13:16Z" level=info msg="Firewalld running: false"
Mar 29 07:13:16 30sec-worker4 docker[30376]: time="2017-03-29T07:13:16Z" level=info msg="Firewalld running: false"
Mar 29 07:13:16 30sec-worker4 docker[30376]: time="2017-03-29T07:13:16Z" level=info msg="Firewalld running: false"
Mar 29 07:13:16 30sec-worker4 docker[30376]: time="2017-03-29T07:13:16Z" level=info msg="Firewalld running: false"

syslog on 30sec-worker4:
--------------------------
Mar 29 07:12:49 30sec-worker4 docker[30376]: time="2017-03-29T07:12:49.658082975Z" level=info msg="memberlist: Suspect 30sec-worker3-054c94d39b58 has failed, no acks received"
Mar 29 07:12:54 30sec-worker4 docker[30376]: time="2017-03-29T07:12:54.658737367Z" level=info msg="memberlist: Marking 30sec-worker3-054c94d39b58 as failed, suspect timeout reached"
Mar 29 07:12:54 30sec-worker4 kernel: [645679.048975] IPVS: __ip_vs_del_service: enter
Mar 29 07:12:54 30sec-worker4 docker[30376]: time="2017-03-29T07:12:54Z" level=info msg="Firewalld running: false"
Mar 29 07:12:54 30sec-worker4 kernel: [645679.100691] IPVS: __ip_vs_del_service: enter
Mar 29 07:12:54 30sec-worker4 kernel: [645679.130069] IPVS: __ip_vs_del_service: enter
Mar 29 07:12:54 30sec-worker4 kernel: [645679.155859] IPVS: __ip_vs_del_service: enter
Mar 29 07:12:54 30sec-worker4 kernel: [645679.180461] IPVS: __ip_vs_del_service: enter
Mar 29 07:12:54 30sec-worker4 kernel: [645679.205707] IPVS: __ip_vs_del_service: enter
Mar 29 07:12:54 30sec-worker4 kernel: [645679.230326] IPVS: __ip_vs_del_service: enter
Mar 29 07:12:54 30sec-worker4 kernel: [645679.255597] IPVS: __ip_vs_del_service: enter
Mar 29 07:12:54 30sec-worker4 docker[30376]: message repeated 7 times: [ time="2017-03-29T07:12:54Z" level=info msg="Firewalld running: false"]
Mar 29 07:13:09 30sec-worker4 docker[30376]: time="2017-03-29T07:13:09.658056735Z" level=info msg="memberlist: Suspect 30sec-worker3-054c94d39b58 has failed, no acks received"
Mar 29 07:13:16 30sec-worker4 docker[30376]: time="2017-03-29T07:13:16.303689665Z" level=warning msg="memberlist: Refuting a suspect message (from: 30sec-worker4-4ca6b1dcaa42)"
Mar 29 07:13:16 30sec-worker4 docker[30376]: time="2017-03-29T07:13:16Z" level=info msg="Firewalld running: false"
Mar 29 07:13:16 30sec-worker4 docker[30376]: message repeated 7 times: [ time="2017-03-29T07:13:16Z" level=info msg="Firewalld running: false"]

I checked other cases when nginx can't find find upstream and always I find these 3x lines appear most at these times in docker logs in:

level=info msg="memberlist:Suspect 30sec-worker3-054c94d39b58 has failed, no acks received"
level=warning msg="memberlist: Refuting a suspect message (from: 30sec-worker3-054c94d39b58)"
level=warning msg="memberlist: Refuting a dead message (from: 30sec-worker3-054c94d39b58)

By searching other issues, found that these have similar errors, so it may be related:
#28843
#25325

Anything I should check or debug more to spot the problem or is it a bug?
Thank you.

Output of docker version:

Client:
 Version:      17.03.0-ce
 API version:  1.26
 Go version:   go1.7.5
 Git commit:   60ccb22
 Built:        Thu Feb 23 11:02:43 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.03.0-ce
 API version:  1.26 (minimum version 1.12)
 Go version:   go1.7.5
 Git commit:   60ccb22
 Built:        Thu Feb 23 11:02:43 2017
 OS/Arch:      linux/amd64
 Experimental: false

Output of docker info:

Containers: 18
 Running: 3
 Paused: 0
 Stopped: 15
Images: 16
Server Version: 17.03.0-ce
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 83
 Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
Swarm: active
 NodeID: ck99cyhgydt8y1zn8ik2xmcdv
 Is Manager: true
 ClusterID: in0q54eh74ljazrprt0vza3wj
 Managers: 1
 Nodes: 5
 Orchestration:
  Task History Retention Limit: 5
 Raft:
  Snapshot Interval: 10000
  Number of Old Snapshots to Retain: 0
  Heartbeat Tick: 1
  Election Tick: 3
 Dispatcher:
  Heartbeat Period: 5 seconds
 CA Configuration:
  Expiry Duration: 3 months
 Node Address: 172.31.31.146
 Manager Addresses:
  172.31.31.146:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 977c511eda0925a723debdc94d09459af49d082a
runc version: a01dafd48bc1c7cc12bdb01206f9fea7dd6feb70
init version: 949e6fa
Security Options:
 apparmor
 seccomp
  Profile: default
Kernel Version: 4.4.0-57-generic
Operating System: Ubuntu 16.04.1 LTS
OSType: linux
Architecture: x86_64
CPUs: 1
Total Memory: 990.6 MiB
Name: 30sec-manager1
ID: 5IIF:RONB:Y27Q:5MKX:ENEE:HZWM:XYBV:O6KN:BKL6:AEUK:2VKB:MO5P
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
WARNING: No swap limit support
Labels:
 provider=amazonec2
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

Additional environment details (AWS, VirtualBox, physical, etc.):
Amazon AWS (Manager - t2.micro, rest of nodes - t2.small)

docker-compose.yml (There are more services and nodes in setup, but I posted only involved ones)

version: "3"

services:

  nginx:
    image: 333435094895.dkr.ecr.us-east-1.amazonaws.com/swarm/nginx:latest
    ports:
      - 80:80
      - 81:81
    networks:
      - thirtysec
    depends_on:
      - web
    deploy:
      mode: global
      update_config:
        delay: 2s
        monitor: 2s

  web:
    image: 333435094895.dkr.ecr.us-east-1.amazonaws.com/swarm/os:latest
    command: sh -c "python manage.py collectstatic --noinput && daphne thirtysec.asgi:channel_layer -b 0.0.0.0 -p 5000"
    ports:
      - 5000:5000
    networks:
      - thirtysec
    deploy:
      mode: replicated
      replicas: 2
      labels: [APP=THIRTYSEC]
      update_config:
        delay: 15s
        monitor: 15s
      placement:
        constraints: [node.labels.aws_type == t2.small]

    healthcheck:
      test: goss -g deploy/swarm/checks/web-goss.yaml validate
      interval: 2s
      timeout: 3s
      retries: 15

networks:
    thirtysec:

web-goss.yaml

port:
  tcp:5000:
    listening: true
    ip:
    - 0.0.0.0

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions