Skip to content

Backoff time not honored by error message causes system log flood #36683

@alexvy86

Description

@alexvy86

Description

The dockerd process of a worker node in a docker swarm cluster filled my logs with a barrage of the following error:
level=error msg="agent: session failed" backoff=8s error="no remote hosts provided" module=node/agent node.id=tzah7lertx0mdg9cudrmmjhcg, completely ignoring the supposed backoff time specified in the message, eventually filling the disk and causing docker to crash (among other things).

Steps to reproduce the issue:
Not sure, it just started happening out of the blue yesterday night. Right before the errors, the logs indicate that a process (an application running in one of the containers) was killed due to OOM, but that might be unrelated.

Describe the results you received:
dockerd kept logging the message non-stop (~500 times per second), even though it apparently should have been using something like exponential back-off (the first message says backoff=100ms, the second 300ms, then 700ms, 1.5s, 3.1s, 6,3s, and finally 8s).

Describe the results you expected:
This error message is logged according to the backoff policy it is stating, instead of flooding the logs.

Additional information you deem important (e.g. issue happens only occasionally):
My swarm is setup with 1 manager (this is a just test environment) and 3 worker nodes. The manager is set to availability Pause so it doesn't run tasks, and all three worker nodes are set to Active. I just upgraded all of them to docker 18.03.0-ce yesterday night

Here's a screenshot of the logs, when the errors started happening:
image

Output of docker version:
In the worker node:

Client:
 Version:       18.03.0-ce
 API version:   1.37
 Go version:    go1.9.4
 Git commit:    0520e24
 Built: Wed Mar 21 23:10:01 2018
 OS/Arch:       linux/amd64
 Experimental:  false
 Orchestrator:  swarm

Server:
 Engine:
  Version:      18.03.0-ce
  API version:  1.37 (minimum version 1.12)
  Go version:   go1.9.4
  Git commit:   0520e24
  Built:        Wed Mar 21 23:08:31 2018
  OS/Arch:      linux/amd64
  Experimental: false

In the manager node:

Client:
 Version:       18.03.0-ce
 API version:   1.37
 Go version:    go1.9.4
 Git commit:    0520e24
 Built: Wed Mar 21 23:10:01 2018
 OS/Arch:       linux/amd64
 Experimental:  false
 Orchestrator:  swarm

Server:
 Engine:
  Version:      18.03.0-ce
  API version:  1.37 (minimum version 1.12)
  Go version:   go1.9.4
  Git commit:   0520e24
  Built:        Wed Mar 21 23:08:31 2018
  OS/Arch:      linux/amd64
  Experimental: false

Output of docker info:
In the worker node:

Containers: 269
 Running: 0
 Paused: 0
 Stopped: 269
Images: 217
Server Version: 18.03.0-ce
Storage Driver: overlay2
 Backing Filesystem: extfs
 Supports d_type: true
 Native Overlay Diff: false
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: pending
 NodeID: tzah7lertx0mdg9cudrmmjhcg
 Is Manager: false
 Node Address: 10.0.0.7
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: cfd04396dc68220d1cecbe686a6cc3aa5ce3667c
runc version: 4fc53a81fb7c994640722ac585fa9ca548971871
init version: 949e6fa
Security Options:
 apparmor
 seccomp
  Profile: default
Kernel Version: 4.13.0-1011-azure
Operating System: Ubuntu 16.04.4 LTS
OSType: linux
Architecture: x86_64
CPUs: 1
Total Memory: 1.863GiB
Name: swarmm-agent-13957614000003
ID: N5OP:KAVU:DGGS:Y3D7:CTTZ:VXF7:A7OV:6CLA:ZNR5:CAWU:5BMM:DHMI
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

WARNING: No swap limit support

In the manager node:

Containers: 0
 Running: 0
 Paused: 0
 Stopped: 0
Images: 216
Server Version: 18.03.0-ce
Storage Driver: overlay2
 Backing Filesystem: extfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: active
 NodeID: 09kaqkkjwsx4yq6ve9jxn8n3h
 Is Manager: true
 ClusterID: xozu5p2qf7kkyad3q9224schk
 Managers: 0
 Nodes: 4
 Orchestration:
  Task History Retention Limit: 5
 Raft:
  Snapshot Interval: 10000
  Number of Old Snapshots to Retain: 0
  Heartbeat Tick: 1
  Election Tick: 3
 Dispatcher:
  Heartbeat Period: 5 seconds
 CA Configuration:
  Expiry Duration: 3 months
  Force Rotate: 0
 Autolock Managers: false
 Root Rotation In Progress: false
 Node Address: 172.16.0.5
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: cfd04396dc68220d1cecbe686a6cc3aa5ce3667c
runc version: 4fc53a81fb7c994640722ac585fa9ca548971871
init version: 949e6fa
Security Options:
 apparmor
 seccomp
  Profile: default
Kernel Version: 4.4.0-116-generic
Operating System: Ubuntu 16.04.4 LTS
OSType: linux
Architecture: x86_64
CPUs: 1
Total Memory: 1.873GiB
Name: swarmm-master-13957614-0
ID: IMAS:YBUM:ILQH:6GEI:XVCC:QOWP:H6YI:JZ74:JONP:LKI6:YIGW:HPB2
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Username: alexvy86
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

WARNING: No swap limit support

Additional environment details (AWS, VirtualBox, physical, etc.):
The manager and the nodes are Virtual Machine Scale Sets in Azure, originally deployed using these instructions.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions