Skip to content

dockerd fails to fully start with containerd snapshotter and many builds/images: context deadline exceeded #48569

@saiarcot895

Description

@saiarcot895

Description

When using docker with the containerd snapshotter enabled, if there are many images (maybe 40 or so) and many builds that have been done through buildkit, and the system gets rebooted, then there's a chance that the docker daemon may fail to fully start up the next time. Containerd starts up fine, but when the docker daemon starts, it and containerd use 100% CPU for about 1-2 minutes, and then dockerd exits with "context deadline exceeded". During this time, docker commands cannot be run at all.

All communication between docker and containerd is local, through a unix socket.

Reproduce

Unsure on the exact trigger exactly, but I would guess there would need to be a lot of images, a lot of layers, and/or a lot of builds that need to have been done.

Expected behavior

Docker daemon should start up fine.

docker version

Client: Docker Engine - Community
 Version:           27.3.1
 API version:       1.47
 Go version:        go1.22.7
 Git commit:        ce12230
 Built:             Fri Sep 20 11:41:00 2024
 OS/Arch:           linux/amd64
 Context:           default

Server: Docker Engine - Community
 Engine:
  Version:          27.3.1
  API version:      1.47 (minimum version 1.24)
  Go version:       go1.22.7
  Git commit:       41ca978
  Built:            Fri Sep 20 11:41:00 2024
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.7.22
  GitCommit:        7f7fdf5fed64eb6a7caf99b3e12efcf9d60e311c
 io.containerd.runc.v2:
  Version:          1.1.14
  GitCommit:        v1.1.14-0-g2c9f560
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

docker info

Server:
 Containers: 11
  Running: 11
  Paused: 0
  Stopped: 0
 Images: 18
 Server Version: 27.2.1
 Storage Driver: overlayfs
  driver-type: io.containerd.snapshotter.v1
 Logging Driver: journald
 Cgroup Driver: systemd
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 runc
 Default Runtime: io.containerd.runc.v2
 Init Binary: docker-init
 containerd version: 7f7fdf5fed64eb6a7caf99b3e12efcf9d60e311c
 runc version: v1.1.14-0-g2c9f560
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: builtin
  cgroupns
 Kernel Version: 6.8.0-1014-azure
 Operating System: Ubuntu 22.04.4 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 16
 Total Memory: 62.79GiB
 Name: sarcot-dev-vm
 ID: 355b7bed-4074-4b97-bf6e-6d38008107b7
 Docker Root Dir: /media/docker-data
 Debug Mode: false
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

Additional Info

The docker info output is old data from about 2 weeks ago. Version 27.2.1 was running at the time and ran into this issue, but even after upgrading to 27.3.1, the issue is still seen.

Syslog from 27.3.1:

Oct 02 00:25:36 sarcot-dev-vm systemd[1]: Starting Docker Application Container Engine...
Oct 02 00:25:36 sarcot-dev-vm dockerd[52422]: time="2024-10-02T00:25:36.428163812-07:00" level=info msg="Starting up"
Oct 02 00:25:36 sarcot-dev-vm dockerd[52422]: time="2024-10-02T00:25:36.429102224-07:00" level=info msg="detected 127.0.0.53 nameserver, assuming systemd-resolved, so using resolv.conf: /run/systemd/resolve/resolv.conf"
Oct 02 00:25:36 sarcot-dev-vm dockerd[52422]: time="2024-10-02T00:25:36.465372790-07:00" level=info msg="Starting daemon with containerd snapshotter integration enabled"
Oct 02 00:25:36 sarcot-dev-vm dockerd[52422]: time="2024-10-02T00:25:36.465423014-07:00" level=info msg="Loading containers: start."
Oct 02 00:25:37 sarcot-dev-vm dockerd[52422]: time="2024-10-02T00:25:37.116870359-07:00" level=info msg="Default bridge (docker0) is assigned with an IP address 172.17.0.0/16. Daemon option --bip can be used to set a preferred IP address"
Oct 02 00:25:37 sarcot-dev-vm dockerd[52422]: time="2024-10-02T00:25:37.258799302-07:00" level=info msg="Loading containers: done."
Oct 02 00:25:37 sarcot-dev-vm dockerd[52422]: time="2024-10-02T00:25:37.266067440-07:00" level=info msg="Docker daemon" commit=41ca978 containerd-snapshotter=true storage-driver=overlayfs version=27.3.1
Oct 02 00:25:37 sarcot-dev-vm dockerd[52422]: time="2024-10-02T00:25:37.266129776-07:00" level=info msg="Daemon has completed initialization"
Oct 02 00:26:26 sarcot-dev-vm dockerd[52422]: time="2024-10-02T00:26:26.341087280-07:00" level=info msg="Processing signal 'terminated'"
Oct 02 00:27:28 sarcot-dev-vm dockerd[52422]: context deadline exceeded
Oct 02 00:27:28 sarcot-dev-vm systemd[1]: docker.service: Main process exited, code=exited, status=1/FAILURE
Oct 02 00:27:28 sarcot-dev-vm systemd[1]: docker.service: Failed with result 'exit-code'.
Oct 02 00:27:28 sarcot-dev-vm systemd[1]: Stopped Docker Application Container Engine.
Oct 02 00:27:28 sarcot-dev-vm systemd[1]: docker.service: Consumed 1min 32.272s CPU time.

Go stack snapshot during the 1-2 minute period of high CPU usage is attached here, from version 27.3.1:
goroutine-stacks-2024-10-01T235510-0700.log

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions