Skip to content

Overlay networking not working after leaving and rejoining a swarm #30314

@Bilathon

Description

@Bilathon

I create a simple swarm consisting of one overlay netwrk (test-network) and three nodes: manager, worker1, worker2. I start a busybox container on both worker nodes. They can find each other without a problem.

Then one node is removed from the swarm using 'docker swarm leave' and rejoins it later. Now the restarted busybox container has connectivity to the rest of the overlay network and sometimes it does not.

Also when I restart a node ('docker-machine restart worker1'), the restarted busybox container sometimes has a connection tot he overlay network and sometimes it does not.

When it does not, sometimes it helps to leave the swarm and rejoin again.

Steps to reproduce the issue:
Run this script:

#!/bin/sh
docker-machine create --driver virtualbox --virtualbox-hostonly-cidr '25.0.1.100/24' manager
docker-machine create --driver virtualbox --virtualbox-hostonly-cidr '25.0.1.100/24' worker1 
docker-machine create --driver virtualbox --virtualbox-hostonly-cidr '25.0.1.100/24' worker2 

IP=`docker-machine ip manager`
docker-machine ssh manager "docker swarm init --advertise-addr $IP"
docker-machine ssh manager "docker network create --driver overlay --opt encrypted --attachable test-network"

TOKEN=`docker-machine ssh manager "docker swarm join-token -q worker"`
docker-machine ssh worker1 "docker swarm join --token $TOKEN $IP:2377"
docker-machine ssh worker2 "docker swarm join --token $TOKEN $IP:2377"

docker-machine ssh manager "docker service create --replicas 1 --constraint 'node.hostname==worker1' --name bb1 --network test-network busybox sleep 3000"
docker-machine ssh manager "docker service create --replicas 1 --constraint 'node.hostname==worker2' --name bb2 --network test-network busybox sleep 3000"
# Give manager time to start the service
sleep 30

echo "System setup complete, trying nslookup of bb2 in bb1:"
NAME=`docker-machine ssh worker1 "docker ps --format '{{.Names}}'"`
docker-machine ssh worker1 "docker exec $NAME nslookup bb2"

docker-machine ssh worker1 "docker swarm leave"
docker-machine ssh worker1 "docker swarm join --token $TOKEN $IP:2377"
# Give manager time to restart the service
sleep 30

# Perform nslookup after rejoin
NAME=`docker-machine ssh worker1 "docker ps --format '{{.Names}}'"`
docker-machine ssh worker1 "docker exec $NAME nslookup bb2"

docker-machine restart worker1
# After the restart worker1 still thinks it is in a swarm, but manager
# does not restart the service until we rejoin
docker-machine ssh worker1 "docker swarm leave"
docker-machine ssh worker1 "docker swarm join --token $TOKEN $IP:2377"
# Give manager time to restart the service
sleep 30

echo "Trying nslookup after reboot:"
NAME=`docker-machine ssh worker1 "docker ps --format '{{.Names}}'"`
docker-machine ssh worker1 "docker exec $NAME nslookup bb2"

echo "Leaving and rejoining swarm"
docker-machine ssh worker1 "docker swarm leave"
docker-machine ssh worker1 "docker swarm join --token $TOKEN $IP:2377"
# Give manager time to restart the service
sleep 30

echo "Retrying nslookup after rejoin:"
NAME=`docker-machine ssh worker1 "docker ps --format '{{.Names}}'"`
docker-machine ssh worker1 "docker exec $NAME nslookup bb2"

# To clean up after this test:
# docker-machine rm manager worker1 worker2

Describe the results you received:
The first nslookup always succeeds, the others fail more often than not. I have run this script dozens of times and most of the time one of the other three nslookup executions succeeds and the other two fail.

Describe the results you expected:
All nslookup executions should succeed in finding the other service.

Additional information you deem important (e.g. issue happens only occasionally):

Output of docker version:

docker@manager:~$ docker version
Client:
 Version:      1.13.0
 API version:  1.25
 Go version:   go1.7.3
 Git commit:   49bf474
 Built:        Wed Jan 18 16:20:26 2017
 OS/Arch:      linux/amd64

Server:
 Version:      1.13.0
 API version:  1.25 (minimum version 1.12)
 Go version:   go1.7.3
 Git commit:   49bf474
 Built:        Wed Jan 18 16:20:26 2017
 OS/Arch:      linux/amd64
 Experimental: false

Output of docker info:

docker@manager:~$ docker version
Client:
 Version:      1.13.0
 API version:  1.25
 Go version:   go1.7.3
 Git commit:   49bf474
 Built:        Wed Jan 18 16:20:26 2017
 OS/Arch:      linux/amd64

Server:
 Version:      1.13.0
 API version:  1.25 (minimum version 1.12)
 Go version:   go1.7.3
 Git commit:   49bf474
 Built:        Wed Jan 18 16:20:26 2017
 OS/Arch:      linux/amd64
 Experimental: false
docker@manager:~$ docker info
Containers: 0
 Running: 0
 Paused: 0
 Stopped: 0
Images: 0
Server Version: 1.13.0
Storage Driver: aufs
 Root Dir: /mnt/sda1/var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 0
 Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins: 
 Volume: local
 Network: bridge host macvlan null overlay
Swarm: active
 NodeID: 5mjy2yknmuwr3xsy8jmjnxdcb
 Is Manager: true
 ClusterID: 84b96auhfjkwxch1nlcanp55z
 Managers: 1
 Nodes: 5
 Orchestration:
  Task History Retention Limit: 5
 Raft:
  Snapshot Interval: 10000
  Number of Old Snapshots to Retain: 0
  Heartbeat Tick: 1
  Election Tick: 3
 Dispatcher:
  Heartbeat Period: 5 seconds
 CA Configuration:
  Expiry Duration: 3 months
 Node Address: 25.0.1.136
 Manager Addresses:
  25.0.1.136:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 03e5862ec0d8d3b3f750e19fca3ee367e13c090e
runc version: 2f7393a47307a16f8cee44a37b262e8b81021e3e
init version: 949e6fa
Security Options:
 seccomp
  Profile: default
Kernel Version: 4.4.43-boot2docker
Operating System: Boot2Docker 1.13.0 (TCL 7.2); HEAD : 5b8d9cb - Wed Jan 18 18:50:40 UTC 2017
OSType: linux
Architecture: x86_64
CPUs: 1
Total Memory: 995.8 MiB
Name: manager
ID: 4CCI:YIOM:NFWM:45JD:HIGQ:2NKJ:EIW7:MOM5:2X7E:KPXL:XHDV:YVBW
Docker Root Dir: /mnt/sda1/var/lib/docker
Debug Mode (client): false
Debug Mode (server): true
 File Descriptors: 32
 Goroutines: 144
 System Time: 2017-01-20T13:13:55.937666998Z
 EventsListeners: 0
Registry: https://index.docker.io/v1/
Labels:
 provider=virtualbox
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

Additional environment details (AWS, VirtualBox, physical, etc.):
The host machine is running Ubuntu 16.04, the output of 'uname-a' is:

Linux fantan 4.4.0-59-generic #80-Ubuntu SMP Fri Jan 6 17:47:47 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

Below is a typical failure output, in this case nslookup 1 and 3 succeed, 2 and 4 fail. I have other traces where 1 and 4 succeed but 2 and 3 fail.

Running pre-create checks...
Creating machine...
(manager) Copying /home/pieter/.docker/machine/cache/boot2docker.iso to /home/pieter/.docker/machine/machines/manager/boot2docker.iso...
(manager) Creating VirtualBox VM...
(manager) Creating SSH key...
(manager) Starting the VM...
(manager) Check network to re-create if needed...
(manager) Waiting for an IP...
Waiting for machine to be running, this may take a few minutes...
Detecting operating system of created instance...
Waiting for SSH to be available...
Detecting the provisioner...
Provisioning with boot2docker...
Copying certs to the local machine directory...
Copying certs to the remote machine...
Setting Docker configuration on the remote daemon...
Checking connection to Docker...
Docker is up and running!
To see how to connect your Docker Client to the Docker Engine running on this virtual machine, run: docker-machine env manager
Running pre-create checks...
Creating machine...
(worker1) Copying /home/pieter/.docker/machine/cache/boot2docker.iso to /home/pieter/.docker/machine/machines/worker1/boot2docker.iso...
(worker1) Creating VirtualBox VM...
(worker1) Creating SSH key...
(worker1) Starting the VM...
(worker1) Check network to re-create if needed...
(worker1) Waiting for an IP...
Waiting for machine to be running, this may take a few minutes...
Detecting operating system of created instance...
Waiting for SSH to be available...
Detecting the provisioner...
Provisioning with boot2docker...
Copying certs to the local machine directory...
Copying certs to the remote machine...
Setting Docker configuration on the remote daemon...
Checking connection to Docker...
Docker is up and running!
To see how to connect your Docker Client to the Docker Engine running on this virtual machine, run: docker-machine env worker1
Running pre-create checks...
Creating machine...
(worker2) Copying /home/pieter/.docker/machine/cache/boot2docker.iso to /home/pieter/.docker/machine/machines/worker2/boot2docker.iso...
(worker2) Creating VirtualBox VM...
(worker2) Creating SSH key...
(worker2) Starting the VM...
(worker2) Check network to re-create if needed...
(worker2) Waiting for an IP...
Waiting for machine to be running, this may take a few minutes...
Detecting operating system of created instance...
Waiting for SSH to be available...
Detecting the provisioner...
Provisioning with boot2docker...
Copying certs to the local machine directory...
Copying certs to the remote machine...
Setting Docker configuration on the remote daemon...
Checking connection to Docker...
Docker is up and running!
To see how to connect your Docker Client to the Docker Engine running on this virtual machine, run: docker-machine env worker2
Swarm initialized: current node (c8qovx9scug0a6yf9ktt17yz7) is now a manager.

To add a worker to this swarm, run the following command:

    docker swarm join \
    --token SWMTKN-1-2u48gafz6i7cnst9t6q11n9cvat9arbdmd30ypv6loejqyxrm0-4ijnpvsrll9iq49hauen8m8u7 \
    25.0.1.118:2377

To add a manager to this swarm, run 'docker swarm join-token manager' and follow the instructions.

sou6jkxt9id4fvgcty841hgv1
This node joined a swarm as a worker.
This node joined a swarm as a worker.
s9w03elxjb1bi4vaipqa3zib7
q3ang10jayg5l0veyczu04bna
System setup complete, trying nslookup of bb2 in bb1:
Server:    127.0.0.11
Address 1: 127.0.0.11

Name:      bb2
Address 1: 10.0.0.4
Node left the swarm.
This node joined a swarm as a worker.
Server:    127.0.0.11
Address 1: 127.0.0.11

nslookup: can't resolve 'bb2'
exit status 1
Restarting "worker1"...
(worker1) Check network to re-create if needed...
(worker1) Waiting for an IP...
Waiting for SSH to be available...
Detecting the provisioner...
Restarted machines may have new IP addresses. You may need to re-run the `docker-machine env` command.
Error response from daemon: context deadline exceeded
exit status 1
This node joined a swarm as a worker.
Trying nslookup after reboot:
Server:    127.0.0.11
Address 1: 127.0.0.11

Name:      bb2
Address 1: 10.0.0.4
Leaving and rejoining swarm
Node left the swarm.
This node joined a swarm as a worker.
Retrying nslookup after rejoin:
Server:    127.0.0.11
Address 1: 127.0.0.11

nslookup: can't resolve 'bb2'
exit status 1

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions