Description
The dockerd process of a worker node in a docker swarm cluster filled my logs with a barrage of the following error:
level=error msg="agent: session failed" backoff=8s error="no remote hosts provided" module=node/agent node.id=tzah7lertx0mdg9cudrmmjhcg, completely ignoring the supposed backoff time specified in the message, eventually filling the disk and causing docker to crash (among other things).
Steps to reproduce the issue:
Not sure, it just started happening out of the blue yesterday night. Right before the errors, the logs indicate that a process (an application running in one of the containers) was killed due to OOM, but that might be unrelated.
Describe the results you received:
dockerd kept logging the message non-stop (~500 times per second), even though it apparently should have been using something like exponential back-off (the first message says backoff=100ms, the second 300ms, then 700ms, 1.5s, 3.1s, 6,3s, and finally 8s).
Describe the results you expected:
This error message is logged according to the backoff policy it is stating, instead of flooding the logs.
Additional information you deem important (e.g. issue happens only occasionally):
My swarm is setup with 1 manager (this is a just test environment) and 3 worker nodes. The manager is set to availability Pause so it doesn't run tasks, and all three worker nodes are set to Active. I just upgraded all of them to docker 18.03.0-ce yesterday night
Here's a screenshot of the logs, when the errors started happening:

Output of docker version:
In the worker node:
Client:
Version: 18.03.0-ce
API version: 1.37
Go version: go1.9.4
Git commit: 0520e24
Built: Wed Mar 21 23:10:01 2018
OS/Arch: linux/amd64
Experimental: false
Orchestrator: swarm
Server:
Engine:
Version: 18.03.0-ce
API version: 1.37 (minimum version 1.12)
Go version: go1.9.4
Git commit: 0520e24
Built: Wed Mar 21 23:08:31 2018
OS/Arch: linux/amd64
Experimental: false
In the manager node:
Client:
Version: 18.03.0-ce
API version: 1.37
Go version: go1.9.4
Git commit: 0520e24
Built: Wed Mar 21 23:10:01 2018
OS/Arch: linux/amd64
Experimental: false
Orchestrator: swarm
Server:
Engine:
Version: 18.03.0-ce
API version: 1.37 (minimum version 1.12)
Go version: go1.9.4
Git commit: 0520e24
Built: Wed Mar 21 23:08:31 2018
OS/Arch: linux/amd64
Experimental: false
Output of docker info:
In the worker node:
Containers: 269
Running: 0
Paused: 0
Stopped: 269
Images: 217
Server Version: 18.03.0-ce
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Native Overlay Diff: false
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: pending
NodeID: tzah7lertx0mdg9cudrmmjhcg
Is Manager: false
Node Address: 10.0.0.7
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: cfd04396dc68220d1cecbe686a6cc3aa5ce3667c
runc version: 4fc53a81fb7c994640722ac585fa9ca548971871
init version: 949e6fa
Security Options:
apparmor
seccomp
Profile: default
Kernel Version: 4.13.0-1011-azure
Operating System: Ubuntu 16.04.4 LTS
OSType: linux
Architecture: x86_64
CPUs: 1
Total Memory: 1.863GiB
Name: swarmm-agent-13957614000003
ID: N5OP:KAVU:DGGS:Y3D7:CTTZ:VXF7:A7OV:6CLA:ZNR5:CAWU:5BMM:DHMI
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
WARNING: No swap limit support
In the manager node:
Containers: 0
Running: 0
Paused: 0
Stopped: 0
Images: 216
Server Version: 18.03.0-ce
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: active
NodeID: 09kaqkkjwsx4yq6ve9jxn8n3h
Is Manager: true
ClusterID: xozu5p2qf7kkyad3q9224schk
Managers: 0
Nodes: 4
Orchestration:
Task History Retention Limit: 5
Raft:
Snapshot Interval: 10000
Number of Old Snapshots to Retain: 0
Heartbeat Tick: 1
Election Tick: 3
Dispatcher:
Heartbeat Period: 5 seconds
CA Configuration:
Expiry Duration: 3 months
Force Rotate: 0
Autolock Managers: false
Root Rotation In Progress: false
Node Address: 172.16.0.5
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: cfd04396dc68220d1cecbe686a6cc3aa5ce3667c
runc version: 4fc53a81fb7c994640722ac585fa9ca548971871
init version: 949e6fa
Security Options:
apparmor
seccomp
Profile: default
Kernel Version: 4.4.0-116-generic
Operating System: Ubuntu 16.04.4 LTS
OSType: linux
Architecture: x86_64
CPUs: 1
Total Memory: 1.873GiB
Name: swarmm-master-13957614-0
ID: IMAS:YBUM:ILQH:6GEI:XVCC:QOWP:H6YI:JZ74:JONP:LKI6:YIGW:HPB2
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Username: alexvy86
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
WARNING: No swap limit support
Additional environment details (AWS, VirtualBox, physical, etc.):
The manager and the nodes are Virtual Machine Scale Sets in Azure, originally deployed using these instructions.
Description
The
dockerdprocess of a worker node in a docker swarm cluster filled my logs with a barrage of the following error:level=error msg="agent: session failed" backoff=8s error="no remote hosts provided" module=node/agent node.id=tzah7lertx0mdg9cudrmmjhcg, completely ignoring the supposed backoff time specified in the message, eventually filling the disk and causing docker to crash (among other things).Steps to reproduce the issue:
Not sure, it just started happening out of the blue yesterday night. Right before the errors, the logs indicate that a process (an application running in one of the containers) was killed due to OOM, but that might be unrelated.
Describe the results you received:
dockerdkept logging the message non-stop (~500 times per second), even though it apparently should have been using something like exponential back-off (the first message saysbackoff=100ms, the second300ms, then700ms,1.5s,3.1s,6,3s, and finally8s).Describe the results you expected:
This error message is logged according to the backoff policy it is stating, instead of flooding the logs.
Additional information you deem important (e.g. issue happens only occasionally):
My swarm is setup with 1 manager (this is a just test environment) and 3 worker nodes. The manager is set to availability
Pauseso it doesn't run tasks, and all three worker nodes are set toActive. I just upgraded all of them to docker 18.03.0-ce yesterday nightHere's a screenshot of the logs, when the errors started happening:

Output of
docker version:In the worker node:
In the manager node:
Output of
docker info:In the worker node:
In the manager node:
Additional environment details (AWS, VirtualBox, physical, etc.):
The manager and the nodes are Virtual Machine Scale Sets in Azure, originally deployed using these instructions.