swarm mode: Manager(s) cascading failures after `docker service scale` and daemon running out of memory

**Output of `docker version`:**

```
Client:
 Version:      1.12.0-dev
 API version:  1.25
 Go version:   go1.6.2
 Git commit:   cccfe63
 Built:        Mon Jun 27 17:46:02 2016
 OS/Arch:      linux/amd64

Server:
 Version:      1.12.0-dev
 API version:  1.25
 Go version:   go1.6.2
 Git commit:   cccfe63
 Built:        Mon Jun 27 17:46:02 2016
 OS/Arch:      linux/amd64
```

**Output of `docker info`:**

```
Containers: 241
 Running: 5
 Paused: 0
 Stopped: 236
Images: 1
Server Version: 1.12.0-dev
Storage Driver: overlay
 Backing Filesystem: extfs
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins: 
 Volume: local
 Network: host bridge overlay null
Swarm: active
 NodeID: bnpa9guxfqznkkp6g9qmtbiim
 IsManager: No
Runtimes: default
Default Runtime: default
Security Options: apparmor seccomp
Kernel Version: 4.4.0-22-generic
Operating System: Ubuntu 16.04 LTS
OSType: linux
Architecture: x86_64
CPUs: 1
Total Memory: 992.6 MiB
Name: node05
ID: VPNA:RUIV:3HML:CTPL:2ZHA:7FXL:IPY7:JZTK:FJYX:KM77:MHZ2:W22H
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
WARNING: No swap limit support
Insecure Registries:
 127.0.0.0/8
```

**Additional environment details (AWS, VirtualBox, physical, etc.):**

Digital Ocean VMs:

_3 Managers_: 2 GB Memory / 40 GB Disk / Ubuntu 16.04 x64
_3 Agents_: 1 GB Memory / 30 GB Disk / Ubuntu 16.04 x64

**Steps to reproduce the issue:**
1. Create the cluster (`docker swarm init <...>` / `docker swarm join <...>`)
2. Create one service (redis for example)
3. Scale to a ridiculous amount of tasks with `docker service scale redis=3000`

**Describe the results you received:**

Managers panic one after the other until we lose the quorum.

What happens in order:
- The leader schedules the tasks and we see the counter going up using `docker service ls`
- The leader reaches the point where it has too many containers running and the daemon runs out of memory or out of fds. Ultimately it crashes.
- Raft elects a new Leader amongst the Managers which picks up the scheduling logic.
- Same scenario, reaches to the point where there are too many open files or runs out of memory because of the tasks running.
- etc.
- We lose the quorum and the cluster becomes unusable.

One single command triggered a chain reaction that could put the cluster out of use.

**Describe the results you expected:**

I expect the daemon to keep enough space and not schedule more tasks on the Leader or other Managers if this could put the cluster stability in danger.

**Further thoughts:**

I'm not sure if there is any good solution for this, but at least we should keep the Managers safe.

Some proposals and actionable items:
- We clearly document the behavior and warn users to preserve exclusive resources for the daemon not to crash.
- We document that if you want the set of Managers to stay "safe", you should opt them out of the cluster by draining them and turning them off as Agents.
- We take a much finer grained decision on scheduling and stop scaling with a Warning when we approach the memory limit or max number of fds. This could give the opportunity for the user to correct that mistake and revert to a reasonable amount of tasks. In this case we control the amount of resources left at the daemon level to disable the Agent if we approach a given threshold.

/cc @aluzzardi @aaronlehmann @stevvooe @icecrime @tiborvass 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

swarm mode: Manager(s) cascading failures after `docker service scale` and daemon running out of memory #24027

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

swarm mode: Manager(s) cascading failures after docker service scale and daemon running out of memory #24027

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

swarm mode: Manager(s) cascading failures after `docker service scale` and daemon running out of memory #24027