Skip to content

daemon: clean up dead containers on start#51692

Merged
vvoland merged 1 commit intomoby:masterfrom
akerouanton:remove-dead-ctrs-on-startup
Dec 11, 2025
Merged

daemon: clean up dead containers on start#51692
vvoland merged 1 commit intomoby:masterfrom
akerouanton:remove-dead-ctrs-on-startup

Conversation

@akerouanton
Copy link
Copy Markdown
Member

@akerouanton akerouanton commented Dec 11, 2025

- What I did

Stopping the Engine while a container with autoremove set is running may leave behind dead containers on disk. These containers aren't reclaimed on next start, appear as "dead" in docker ps -a and can't be inspected or removed by the user.

This bug has existed since a long time but became user visible with 9f5f4f5. Prior to that commit, containers with no rwlayer weren't added to the in-memory viewdb, so they weren't visible in docker ps -a. However, some dangling files would still live on disk (e.g. folder in /var/lib/docker/containers, mount points, etc).

The underlying issue is that when the daemon stops, it tries to stop all running containers and then closes the containerd client. This leaves a small window of time where the Engine might receive 'task stop' events from containerd, and trigger autoremove. If the containerd client is closed in parallel, the Engine is unable to complete the removal, leaving the container in 'dead' state. In such case, the Engine logs the following error:

cannot remove container "bcbc98b4f5c2b072eb3c4ca673fa1c222d2a8af00bf58eae0f37085b9724ea46": Canceled: grpc: the client connection is closing: context canceled

Solving the underlying issue would require complex changes to the shutdown sequence. Moreover, the same issue could also happen if the daemon crashes while it deletes a container. Thus, add a cleanup step on daemon startup to remove these dead containers.

- How to verify it

A new integration test has been added.

- Human readable description for the release notes

Fix a bug that could cause the Engine to leave containers with autoremove set in 'dead' state on shutdown, and never reclaim them.

@akerouanton akerouanton added this to the 29.2.0 milestone Dec 11, 2025
@akerouanton akerouanton self-assigned this Dec 11, 2025
@akerouanton akerouanton force-pushed the remove-dead-ctrs-on-startup branch from 264a31f to fd96eaf Compare December 11, 2025 17:11
@akerouanton akerouanton marked this pull request as ready for review December 11, 2025 17:17
Copy link
Copy Markdown
Contributor

@austinvazquez austinvazquez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@vvoland
Copy link
Copy Markdown
Contributor

vvoland commented Dec 11, 2025

The CI fails on Windows:

=== FAIL: integration/container TestRemoveDeadContainersOnDaemonRestart (0.53s)
    daemon.go:333: [de1dbe7ab70dc] Status: unknown flag: --userland-proxy
    daemon.go:333: [de1dbe7ab70dc] See 'dockerd --help'., Code: 125
    remove_test.go:117: [de1dbe7ab70dc] failed to start daemon with arguments [--config-file /dev/null --data-root D:\a\moby\moby\go\src\github.com\docker\docker\bundles\tmp\TestRemoveDeadContainersOnDaemonRestart\de1dbe7ab70dc\root --exec-root C:\Users\RUNNER~1\AppData\Local\Temp\dxr\de1dbe7ab70dc --pidfile D:\a\moby\moby\go\src\github.com\docker\docker\bundles\tmp\TestRemoveDeadContainersOnDaemonRestart\de1dbe7ab70dc\docker.pid --userland-proxy=true --containerd-namespace de1dbe7ab70dc --containerd-plugins-namespace de1dbe7ab70dcp --containerd /var/run/docker/containerd/containerd.sock --host unix://C:\Users\RUNNER~1\AppData\Local\Temp\docker-integration\de1dbe7ab70dc.sock --debug] : [de1dbe7ab70dc] daemon exited during startup: exit status 1

Stopping the Engine while a container with autoremove set is running may
leave behind dead containers on disk. These containers aren't reclaimed
on next start, appear as "dead" in `docker ps -a` and can't be
inspected or removed by the user.

This bug has existed since a long time but became user visible with
9f5f4f5. Prior to that commit,
containers with no rwlayer weren't added to the in-memory viewdb, so
they weren't visible in `docker ps -a`. However, some dangling files
would still live on disk (e.g. folder in /var/lib/docker/containers,
mount points, etc).

The underlying issue is that when the daemon stops, it tries to stop all
running containers and then closes the containerd client. This leaves a
small window of time where the Engine might receive 'task stop' events
from containerd, and trigger autoremove. If the containerd client is
closed in parallel, the Engine is unable to complete the removal,
leaving the container in 'dead' state. In such case, the Engine logs the
following error:

    cannot remove container "bcbc98b4f5c2b072eb3c4ca673fa1c222d2a8af00bf58eae0f37085b9724ea46": Canceled: grpc: the client connection is closing: context canceled

Solving the underlying issue would require complex changes to the
shutdown sequence. Moreover, the same issue could also happen if the
daemon crashes while it deletes a container. Thus, add a cleanup step
on daemon startup to remove these dead containers.

Signed-off-by: Albin Kerouanton <[email protected]>
@austinvazquez austinvazquez force-pushed the remove-dead-ctrs-on-startup branch from fd96eaf to ec9315c Compare December 11, 2025 19:40
Copy link
Copy Markdown
Contributor

@vvoland vvoland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (after CI green)

Copy link
Copy Markdown
Member

@thaJeztah thaJeztah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM on green, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants