-
Notifications
You must be signed in to change notification settings - Fork 18.9k
Description
Description
When using docker with the containerd snapshotter enabled, if there are many images (maybe 40 or so) and many builds that have been done through buildkit, and the system gets rebooted, then there's a chance that the docker daemon may fail to fully start up the next time. Containerd starts up fine, but when the docker daemon starts, it and containerd use 100% CPU for about 1-2 minutes, and then dockerd exits with "context deadline exceeded". During this time, docker commands cannot be run at all.
All communication between docker and containerd is local, through a unix socket.
Reproduce
Unsure on the exact trigger exactly, but I would guess there would need to be a lot of images, a lot of layers, and/or a lot of builds that need to have been done.
Expected behavior
Docker daemon should start up fine.
docker version
Client: Docker Engine - Community
Version: 27.3.1
API version: 1.47
Go version: go1.22.7
Git commit: ce12230
Built: Fri Sep 20 11:41:00 2024
OS/Arch: linux/amd64
Context: default
Server: Docker Engine - Community
Engine:
Version: 27.3.1
API version: 1.47 (minimum version 1.24)
Go version: go1.22.7
Git commit: 41ca978
Built: Fri Sep 20 11:41:00 2024
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.7.22
GitCommit: 7f7fdf5fed64eb6a7caf99b3e12efcf9d60e311c
io.containerd.runc.v2:
Version: 1.1.14
GitCommit: v1.1.14-0-g2c9f560
docker-init:
Version: 0.19.0
GitCommit: de40ad0docker info
Server:
Containers: 11
Running: 11
Paused: 0
Stopped: 0
Images: 18
Server Version: 27.2.1
Storage Driver: overlayfs
driver-type: io.containerd.snapshotter.v1
Logging Driver: journald
Cgroup Driver: systemd
Cgroup Version: 2
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
Swarm: inactive
Runtimes: io.containerd.runc.v2 runc
Default Runtime: io.containerd.runc.v2
Init Binary: docker-init
containerd version: 7f7fdf5fed64eb6a7caf99b3e12efcf9d60e311c
runc version: v1.1.14-0-g2c9f560
init version: de40ad0
Security Options:
apparmor
seccomp
Profile: builtin
cgroupns
Kernel Version: 6.8.0-1014-azure
Operating System: Ubuntu 22.04.4 LTS
OSType: linux
Architecture: x86_64
CPUs: 16
Total Memory: 62.79GiB
Name: sarcot-dev-vm
ID: 355b7bed-4074-4b97-bf6e-6d38008107b7
Docker Root Dir: /media/docker-data
Debug Mode: false
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: falseAdditional Info
The docker info output is old data from about 2 weeks ago. Version 27.2.1 was running at the time and ran into this issue, but even after upgrading to 27.3.1, the issue is still seen.
Syslog from 27.3.1:
Oct 02 00:25:36 sarcot-dev-vm systemd[1]: Starting Docker Application Container Engine...
Oct 02 00:25:36 sarcot-dev-vm dockerd[52422]: time="2024-10-02T00:25:36.428163812-07:00" level=info msg="Starting up"
Oct 02 00:25:36 sarcot-dev-vm dockerd[52422]: time="2024-10-02T00:25:36.429102224-07:00" level=info msg="detected 127.0.0.53 nameserver, assuming systemd-resolved, so using resolv.conf: /run/systemd/resolve/resolv.conf"
Oct 02 00:25:36 sarcot-dev-vm dockerd[52422]: time="2024-10-02T00:25:36.465372790-07:00" level=info msg="Starting daemon with containerd snapshotter integration enabled"
Oct 02 00:25:36 sarcot-dev-vm dockerd[52422]: time="2024-10-02T00:25:36.465423014-07:00" level=info msg="Loading containers: start."
Oct 02 00:25:37 sarcot-dev-vm dockerd[52422]: time="2024-10-02T00:25:37.116870359-07:00" level=info msg="Default bridge (docker0) is assigned with an IP address 172.17.0.0/16. Daemon option --bip can be used to set a preferred IP address"
Oct 02 00:25:37 sarcot-dev-vm dockerd[52422]: time="2024-10-02T00:25:37.258799302-07:00" level=info msg="Loading containers: done."
Oct 02 00:25:37 sarcot-dev-vm dockerd[52422]: time="2024-10-02T00:25:37.266067440-07:00" level=info msg="Docker daemon" commit=41ca978 containerd-snapshotter=true storage-driver=overlayfs version=27.3.1
Oct 02 00:25:37 sarcot-dev-vm dockerd[52422]: time="2024-10-02T00:25:37.266129776-07:00" level=info msg="Daemon has completed initialization"
Oct 02 00:26:26 sarcot-dev-vm dockerd[52422]: time="2024-10-02T00:26:26.341087280-07:00" level=info msg="Processing signal 'terminated'"
Oct 02 00:27:28 sarcot-dev-vm dockerd[52422]: context deadline exceeded
Oct 02 00:27:28 sarcot-dev-vm systemd[1]: docker.service: Main process exited, code=exited, status=1/FAILURE
Oct 02 00:27:28 sarcot-dev-vm systemd[1]: docker.service: Failed with result 'exit-code'.
Oct 02 00:27:28 sarcot-dev-vm systemd[1]: Stopped Docker Application Container Engine.
Oct 02 00:27:28 sarcot-dev-vm systemd[1]: docker.service: Consumed 1min 32.272s CPU time.
Go stack snapshot during the 1-2 minute period of high CPU usage is attached here, from version 27.3.1:
goroutine-stacks-2024-10-01T235510-0700.log