Output of docker version:
Client:
Version: 1.12.0-dev
API version: 1.25
Go version: go1.6.2
Git commit: cccfe63
Built: Mon Jun 27 17:46:02 2016
OS/Arch: linux/amd64
Server:
Version: 1.12.0-dev
API version: 1.25
Go version: go1.6.2
Git commit: cccfe63
Built: Mon Jun 27 17:46:02 2016
OS/Arch: linux/amd64
Output of docker info:
Containers: 241
Running: 5
Paused: 0
Stopped: 236
Images: 1
Server Version: 1.12.0-dev
Storage Driver: overlay
Backing Filesystem: extfs
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: host bridge overlay null
Swarm: active
NodeID: bnpa9guxfqznkkp6g9qmtbiim
IsManager: No
Runtimes: default
Default Runtime: default
Security Options: apparmor seccomp
Kernel Version: 4.4.0-22-generic
Operating System: Ubuntu 16.04 LTS
OSType: linux
Architecture: x86_64
CPUs: 1
Total Memory: 992.6 MiB
Name: node05
ID: VPNA:RUIV:3HML:CTPL:2ZHA:7FXL:IPY7:JZTK:FJYX:KM77:MHZ2:W22H
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
WARNING: No swap limit support
Insecure Registries:
127.0.0.0/8
Additional environment details (AWS, VirtualBox, physical, etc.):
Digital Ocean VMs:
3 Managers: 2 GB Memory / 40 GB Disk / Ubuntu 16.04 x64
3 Agents: 1 GB Memory / 30 GB Disk / Ubuntu 16.04 x64
Steps to reproduce the issue:
- Create the cluster (
docker swarm init <...> / docker swarm join <...>)
- Create one service (redis for example)
- Scale to a ridiculous amount of tasks with
docker service scale redis=3000
Describe the results you received:
Managers panic one after the other until we lose the quorum.
What happens in order:
- The leader schedules the tasks and we see the counter going up using
docker service ls
- The leader reaches the point where it has too many containers running and the daemon runs out of memory or out of fds. Ultimately it crashes.
- Raft elects a new Leader amongst the Managers which picks up the scheduling logic.
- Same scenario, reaches to the point where there are too many open files or runs out of memory because of the tasks running.
- etc.
- We lose the quorum and the cluster becomes unusable.
One single command triggered a chain reaction that could put the cluster out of use.
Describe the results you expected:
I expect the daemon to keep enough space and not schedule more tasks on the Leader or other Managers if this could put the cluster stability in danger.
Further thoughts:
I'm not sure if there is any good solution for this, but at least we should keep the Managers safe.
Some proposals and actionable items:
- We clearly document the behavior and warn users to preserve exclusive resources for the daemon not to crash.
- We document that if you want the set of Managers to stay "safe", you should opt them out of the cluster by draining them and turning them off as Agents.
- We take a much finer grained decision on scheduling and stop scaling with a Warning when we approach the memory limit or max number of fds. This could give the opportunity for the user to correct that mistake and revert to a reasonable amount of tasks. In this case we control the amount of resources left at the daemon level to disable the Agent if we approach a given threshold.
/cc @aluzzardi @aaronlehmann @stevvooe @icecrime @tiborvass
Output of
docker version:Output of
docker info:Additional environment details (AWS, VirtualBox, physical, etc.):
Digital Ocean VMs:
3 Managers: 2 GB Memory / 40 GB Disk / Ubuntu 16.04 x64
3 Agents: 1 GB Memory / 30 GB Disk / Ubuntu 16.04 x64
Steps to reproduce the issue:
docker swarm init <...>/docker swarm join <...>)docker service scale redis=3000Describe the results you received:
Managers panic one after the other until we lose the quorum.
What happens in order:
docker service lsOne single command triggered a chain reaction that could put the cluster out of use.
Describe the results you expected:
I expect the daemon to keep enough space and not schedule more tasks on the Leader or other Managers if this could put the cluster stability in danger.
Further thoughts:
I'm not sure if there is any good solution for this, but at least we should keep the Managers safe.
Some proposals and actionable items:
/cc @aluzzardi @aaronlehmann @stevvooe @icecrime @tiborvass