Skip to content

17.07-ce Docker Swarm - Can't join another manager or promote a worker (initialize raft node: rpc error) #35046

@svscorp

Description

@svscorp

When I try to add a node as a second manager in swarm

docker swarm join --listen-addr THIS_NODE_IP:2377 --advertise-addr THIS_NODE_IP --token "MANAGER_TOKEN" MANAGER-1_NODE_IP:2377

or promote an existing worker

(on a worker "master-2") docker swarm join --listen-addr THIS_NODE_IP:2377 --advertise-addr THIS_NODE_IP --token "WORKER_TOKEN" MANAGER-1_NODE_IP:2377
>>> node has successfully joined the swarm

(on swarm manager & leader) docker node promote master-2

I am getting errors can't initialize raft node: rpc error: code = Unknown desc and node becomes DOWN for Swarm Leader.

Setup:

Docker 1.17.07 (Swarm Mode), RHEL 7.3, Kernel 4.13.3 (same behavior on 3.10), behind proxy.

Expand below items for docker info output:

docker info (swarm master-1)
root@master-1 # docker info
Containers: 3
 Running: 3
 Paused: 0
 Stopped: 0
Images: 4
Server Version: 17.07.0-ce
Storage Driver: overlay
 Backing Filesystem: extfs
 Supports d_type: true
Logging Driver: journald
Cgroup Driver: cgroupfs
Plugins:
 Volume: local (*using volume driver=local, type=nfs*)
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: active
 NodeID: aw5r20gz2pr18m9yglk4qwjh4
 Is Manager: true
 ClusterID: b5gds92q3otj9ego2en95qm34
 Managers: 1
 Nodes: 1
 Orchestration:
  Task History Retention Limit: 5
 Raft:
  Snapshot Interval: 10000
  Number of Old Snapshots to Retain: 0
  Heartbeat Tick: 1
  Election Tick: 3
 Dispatcher:
  Heartbeat Period: 5 seconds
 CA Configuration:
  Expiry Duration: 3 months
  Force Rotate: 0
 Root Rotation In Progress: false
 Node Address: <manager-1-ip>
 Manager Addresses:
<manager-1-ip>:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 3addd840653146c90a254301d6c3a663c7fd6429
runc version: 2d41c047c83e09a6d61d464906feb2a2f3c52aa4
init version: 949e6fa
Security Options:
 seccomp
  Profile: default
Kernel Version: 4.13.3-1.el7.elrepo.x86_64
Operating System: Red Hat Enterprise Linux Server 7.3 (Maipo)
OSType: linux
Architecture: x86_64
Name: master-1
ID: DMJT:S6EF:LUAF:PQKX:I5R6:TU2W:JZRH:MVMM:YJC5:MC2K:SGSQ:PXKM
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Http Proxy: http://<proxy>
Https Proxy: http://<proxy>
No Proxy: <nodes-ips>,<master-1-ip>,<master-2-ip>,<proxy-ip>,127.0.0.1,localhost,sonar,jenkins
Registry: https://index.docker.io/v1/
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false
docker info (swarm master-2)
root@master-2 # docker info
Containers: 0
 Running: 0
 Paused: 0
 Stopped: 0
Images: 0
Server Version: 17.07.0-ce
Storage Driver: overlay
 Backing Filesystem: xfs
 Supports d_type: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 3addd840653146c90a254301d6c3a663c7fd6429
runc version: 2d41c047c83e09a6d61d464906feb2a2f3c52aa4
init version: 949e6fa
Security Options:
 seccomp
  Profile: default
Kernel Version: 3.10.0-514.el7.x86_64
Operating System: Red Hat Enterprise Linux Server 7.3 (Maipo)
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 8GiB
Name: master-2
ID: EY7E:5F7I:OEDA:PBXW:ETM2:IQZU:USYK:6UQM:NGVT:HEFU:TJ3Q:QAUP
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Http Proxy: http://<proxy>
Https Proxy: http://<proxy>
No Proxy: <nodes-ips>,<master-1-ip>,<master-2-ip>,<proxy-ip>,127.0.0.1,localhost,sonar,jenkins
Registry: https://index.docker.io/v1/
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false
docker info (swarm node)
Containers: 5
 Running: 5
 Paused: 0
 Stopped: 0
Images: 8
Server Version: 17.07.0-ce
Storage Driver: overlay
 Backing Filesystem: extfs
 Supports d_type: true
Logging Driver: journald
Cgroup Driver: cgroupfs
Plugins:
 Volume: local (*use volume driver=local type=nfs*)
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: active
 NodeID: ph3qqr9ti750cc3m741rs1dsm
 Is Manager: false
 Node Address:  <node-ip>
 Manager Addresses:
  <manager-ip>:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 3addd840653146c90a254301d6c3a663c7fd6429
runc version: 2d41c047c83e09a6d61d464906feb2a2f3c52aa4
init version: 949e6fa
Security Options:
 seccomp 
  Profile: default
Kernel Version: 4.13.3-1.el7.elrepo.x86_64
Operating System: Red Hat Enterprise Linux Server 7.3 (Maipo)
OSType: linux
Architecture: x86_64
Name: node1
ID: KR3X:353P:3XHC:Z7M3:ZXU3:W4NI:7N4O:VYMI:MTYR:D47M:UMT4:LKH5
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Http Proxy: http://<proxy>
Https Proxy: http://<proxy>
No Proxy: <nodes-ips>,<master-1-ip>,<master-2-ip>,<proxy-ip>,127.0.0.1,localhost,sonar,jenkins
Registry: https://index.docker.io/v1/
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false
---

Steps to reproduce:

  1. on master-1: docker swarm init
  2. on master-1: docker swarm join-token worker
    on node-1: docker swarm join --listen-addr THIS_NODE_IP:2377 --advertise-addr THIS_NODE_IP --token "WORKER_TOKEN" MANAGER-1_NODE_IP:2377
  3. on master-1: docker swarm join-token manager
    on master-2: docker swarm join --listen-addr THIS_NODE_IP:2377 --advertise-addr THIS_NODE_IP --token "MANAGER_TOKEN" MANAGER-1_NODE_IP:2377
  4. "alternative for 3 - join as worker, then promote":
    on master-1: docker swarm join-token worker
    on master-2: docker swarm join --listen-addr THIS_NODE_IP:2377 --advertise-addr THIS_NODE_IP --token "WORKER_TOKEN" MANAGER-1_NODE_IP:2377
    on master-1: docker swarm promote master-2

Expected Result:
Node should successfully join Swarm as Manager and on Leader node docker node ls should display this node as Reachable.

Actual Result
Either joining via manager token or "worker"+"promote" method, node fails with the message cluster exited with error: manager stopped: can't initialize raft node: rpc error: code = Unknown desc = could not connect to prospective new cluster member using its advertised address: rpc error: code = Unavailable desc = grpc: the connection is unavailable,

docker node ls on the Leader shows a manager candidate node status DOWN.
logs:

Sep 30 19:10:46 master-2 dockerd: time="2017-09-30T19:10:46.400925292+02:00" level=info msg="Stopping manager" module=node node.id=7r92yw3pcfcjy4f299dwfwy4l
Sep 30 19:10:46 master-2 dockerd: time="2017-09-30T19:10:46.401013876+02:00" level=info msg="Manager shut down" module=node node.id=7r92yw3pcfcjy4f299dwfwy4l
Sep 30 19:10:46 master-2 dockerd: time="2017-09-30T19:10:46.401086367+02:00" level=info msg="shutting down certificate renewal routine" module="node/tls" node.id=7r92yw3pcfcjy4f299dwfwy4l node.role=swarm-manager
Sep 30 19:10:46 master-2 dockerd: time="2017-09-30T19:10:46.401551984+02:00" level=error msg="cluster exited with error: manager stopped: can't initialize raft node: rpc error: code = Unknown desc = could not connect to prospective new cluster member using its advertised address: rpc error: code = Unavailable desc = grpc: the connection is unavailable"
Sep 30 19:10:46 master-2 dockerd: time="2017-09-30T19:10:46.401601032+02:00" level=warning msg="Restarting swarm in 0.20 seconds"

Network connectivity checked (otherwise rest of the swarm wouldn't work).

Thank you in advance.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions