Fix 'docker cp' mount table explosion, take four by corhere · Pull Request #44210 · moby/moby

corhere · 2022-09-27T22:51:24Z

Supersedes [WIP] Fix mount loop on "docker cp" #38993
Supersedes [WIP] daemon/archive: attempt to resolve leaking mounts during cp #43583
Supersedes Fix 'docker cp' mount loop #44171

- What I did

Fixed docker cp returns “no space left on device” if entire host filesystem is mounted #38995
Fixed Using the docker cp command increases the number of mountpoints in the container #43390
Made the issue reproducible in a DinD environment so it can be covered by a regression test

- How I did it

I got the issue to reproduce in a DinD container by changing the mount propagation mode to shared, which matches the behaviour of SystemD PID 1 outside of a container.
I arranged to have the daemon bind-mount the container volumes and perform the archiving operation inside an unshared mount namespace so the mount events would not propagate into the container's namespace.

- How to verify it

Shiny new integration test

- Description for the changelog

Fixed docker cp failing with "no space left on device" when the daemon root is mounted into the container as a volume

- A picture of a cute animal (not mandatory but encouraged)

fuweid · 2022-10-18T00:07:08Z

/cc

neersighted

Overall this is a lot easier to follow than the existing code, even with the use of cancelable asynchronous execution in the containerFSView thread. I do wonder if it would make sense to pool/reference count these threads, so that concurrent copy operations do not churn threads; it seems like the only barrier to doing so (which should likely be a follow-up PR) is complexity of implementation.

I tested this PR with both Docker-in-Docker (yay, the mount --make-rshared find makes this fully reproducible in dind!) and on a 'bare' Linux system (VM, no Docker-in-Docker), and all works as expected with both the minimal shell-based reproduction from my PR, as well as the stress test based on @kolyshkin's work in #38993.

Some final thoughts:

Can we drop container.StatPath entirely from container/archive.go? It seems like the only usages were here.
There is a moderate cleanup of the path handling code in archive_unix.go, but there is some duplication and ambiguity now, comments would help a lot.

neersighted · 2022-10-25T23:11:16Z

For posterity, I'm using the following to reproduce the issue:

docker run --name mount-repro -d -v /var/run:/var/run busybox sleep 1d
docker exec -it mount-repro cat /proc/self/mountinfo | wc -l
docker cp mount-repro:/etc/passwd ./testfile
docker exec -it mount-repro cat /proc/self/mountinfo | wc -l
docker rm -f mount-repro
rm testfile

This stress-test hammers the daemon with copies and container restarts and blows up the mount table if the bug is not fixed (and is an improvement of a version of the same created by @kolyshkin in #38993):

#!/usr/bin/env bash

set -e

NCONTAINERS=${NCONTAINERS:-$((2*$(nproc)))}
NPARALLEL=${NPARALLEL:-4}
DURATION=${DURATION:-1m}

echo "Creating $NCONTAINERS containers for stress test..."
for _ in $(seq "$NCONTAINERS"); do
  docker run -d -v /var/run:/var/run busybox sleep 1d | tee -a stress-containers
done

OUTDIR=$(mktemp -d)

cleanup() {
  echo "Reaping tasks..."
  for job in $(jobs -p); do
    kill "$job"
  done

  wait

  echo "Killing containers..."
  parallel -u -j"$NPARALLEL" -a stress-containers docker rm -f '{}'
  rm -f stress-containers
  rm -rf "$OUTDIR"
}; trap 'cleanup' EXIT

copy() {
  echo "Starting copy..."
  while test -f stress-containers; do
    parallel -u -j"$NPARALLEL" -a <(shuf stress-containers) docker cp '{}':/etc/passwd "$(mktemp "${OUTDIR}/stress-XXXX")"
  done
}; copy &

restart() {
  echo "Starting restart..."
  while test -f stress-containers; do
    parallel -u -j"$NPARALLEL" -a <(shuf stress-containers) docker restart -t0 '{}'
  done
}; restart &

sleep "$DURATION"
echo "Test complete!"

tianon · 2022-10-24T18:22:48Z


+# Change mount propagation to shared to make the environment more similar to a
+# modern Linux system, e.g. with SystemD as PID 1.
+mount --make-rshared /


This script is used by others (https://hub.docker.com/_/docker, for example) as the canonical "what's necessary for Docker-in-Docker to function correctly" -- I'm guessing this change probably isn't appropriate in most cases outside our tests, right? 😅

I think that to the contrary, it is necessary, as it makes dind's environment more like that of a 'real' Linux system (read: modern init system like systemd).

What I mean is that this change technically changes the way the container shares mounts with the host, right? (It's not restricted just to containers of the Docker-in-Docker instance.)

No, it is restricted just to containers of the DinD instance. Any mount namespaces cloned from the container's namespace will receive un/mount events from the container's namespace. Mount propagation to/from the host namespace is left unchanged.

corhere · 2022-10-26T02:38:39Z

I do wonder if it would make sense to pool/reference count these threads, so that concurrent copy operations do not churn threads; it seems like the only barrier to doing so (which should likely be a follow-up PR) is complexity of implementation.

I've thought about doing so, too. While it would be fun to write, I think it's a premature optimization until benchmarked otherwise: KISS, YAGNI. This implementation is already orders of magnitude faster than fork/exec'ing a whole new Go process (msecs to µsecs!) so any further optimizations will yield diminishing returns.

Some final thoughts:

Can we drop container.StatPath entirely from container/archive.go? It seems like the only usages were here.

How can you drop what has already been deleted? :philosoraptor:

Whatever you meant to ask, the answer is probably "no, because Windows support."

There is a moderate cleanup of the path handling code in archive_unix.go, but there is some duplication and ambiguity now, comments would help a lot.

Which one are you referring to: daemon/archive_unix.go or pkg/chrootarchive/archive_unix.go?

neersighted · 2022-10-26T14:40:06Z

How can you drop what has already been deleted? :philosoraptor:

Whatever you meant to ask, the answer is probably "no, because Windows support."

Ah, that's my bad -- I was testing across OSes and saw the symbol was still there in Goland; I forgot you renamed it to have a _$GOOS tag.

Which one are you referring to: daemon/archive_unix.go or pkg/chrootarchive/archive_unix.go?

daemon/archive_unix.go -- I have individual comments in that file which I'll follow up on; the clarity of path-handling code there is my main reservation.

Modify the DinD entrypoint scripts to make the issue reproducible inside a DinD container. Co-authored-by: Bjorn Neergaard <[email protected]> Signed-off-by: Bjorn Neergaard <[email protected]> Signed-off-by: Cory Snider <[email protected]>

The applyLayer implementation in pkg/chrootarchive has to set the TMPDIR environment variable so that archive.UnpackLayer() can successfully create the whiteout-file temp directory. Change UnpackLayer to create the temporary directory under the destination path so that environment variables do not need to be touched. Signed-off-by: Cory Snider <[email protected]>

Unshare the thread's file system attributes and, if applicable, mount namespace so that the chroot operation does not affect the rest of the process. Signed-off-by: Cory Snider <[email protected]>

This change was introduced early in the development of rootless support, before all the kinks were worked out and rootlesskit was built. The author was testing the daemon by inside a user namespace set up by runc, observed that the unshare(2) syscall was returning EPERM, and assumed that it was a fundamental limitation of user namespaces. Seeing as the kernel documentation (of today) disagrees with that assessment and that unshare demonstrably works inside user namespaces, I can only assume that the EPERM was due to a quirk of their test environment, such as a seccomp filter set up by runc blocking the unshare syscall. moby#20902 (comment) Mount namespaces are necessary to address moby#38995 and moby#43390. Revert the special-casing so those issues can also be fixed for rootless daemons. This reverts commit dc95056. Signed-off-by: Cory Snider <[email protected]>

Refactor pkg/chrootarchive in terms of those utilities. Signed-off-by: Cory Snider <[email protected]>

The Linux implementation needs to diverge significantly from the Windows one in order to fix platform-specific bugs. Cut the generic implementation out of daemon/archive.go and paste identical, verbatim copies of that implementation into daemon/archive_{windows,linux}.go to make it easier to compare the progression of changes to the respective implementations through Git history. Signed-off-by: Cory Snider <[email protected]>

It is only applicable to Windows so it does not need to be called from platform-generic code. Fix locking in the Windows implementation. Signed-off-by: Cory Snider <[email protected]>

Signed-off-by: Cory Snider <[email protected]>

neersighted

Overall looks good to me -- the major increase in readability and clear intention/maintainability of the code is a big win over the previous implementation, even if there is more surface-level complexity (locked goroutines, unshare(2)/kernel APIs, cross-goroutine coordination/async).

Mounting a container's volumes under its rootfs directory inside the host mount namespace causes problems with cross-namespace mount propagation when /var/lib/docker is bind-mounted into the container as a volume. The mount event propagates into the container's mount namespace, overmounting the volume, but the propagated unmount events do not fully reverse the effect. Each archive operation causes the mount table in the container's mount namespace to grow larger and larger, until the kernel limiton the number of mounts in a namespace is hit. The only solution to this issue which is not subject to race conditions or other blocker caveats is to avoid mounting volumes into the container's rootfs directory in the host mount namespace in the first place. Mount the container volumes inside an unshared mount namespace to prevent any mount events from propagating into any other mount namespace. Greatly simplify the archiving implementations by also chrooting into the container rootfs to sidestep the need to resolve paths in the host. Signed-off-by: Cory Snider <[email protected]>

The new daemon.containerFSView type covers all the use-cases on Linux with a much more intuitive API, but is not portable to Windows. Discourage people from using the old and busted functions in new Linux code by excluding them entirely from Linux builds. Signed-off-by: Cory Snider <[email protected]>

cpuguy83 · 2022-11-11T18:47:54Z

Great work.
Is this going to be back ported to 22.06?

neersighted · 2022-11-11T19:42:03Z

It was mentioned at one point, but Cory and Sebastiaan were vehemently opposed due to the late state of the 22.06 branch and the risk/scope of this change.

thaJeztah · 2022-11-14T09:33:00Z

+	"runtime"
+	"strings"
+
+	"github.com/hashicorp/go-multierror"


This changes made go-multierror a direct dependency (no longer indirect only); looks like CI didn't pick that up (wondering if we could have a way to "fast check" such changes 🤔); opened #44453 to fix that.

vijayaramaraju-kalidindi · 2026-03-24T03:13:54Z

@neersighted @corhere @thaJeztah - What is the fix for #52201?? Is it addressed in future releases??

corhere · 2026-03-24T03:27:12Z

@vijayaramaraju-kalidindi you are looking at the fix. It was released with v24.0.

corhere force-pushed the chrootarchive-without-reexec branch 2 times, most recently from c9f966f to c663e5c Compare September 28, 2022 21:33

corhere mentioned this pull request Sep 29, 2022

Stop subprocesses from getting unexpectedly killed #44215

Merged

thaJeztah reviewed Sep 29, 2022

View reviewed changes

Comment thread pkg/chrootarchive/go_linux.go Outdated

corhere force-pushed the chrootarchive-without-reexec branch 4 times, most recently from 36182c9 to 311c00d Compare October 14, 2022 19:19

corhere changed the title ~~[WIP] Chrootarchive without reexec~~ Fix 'docker cp' mount table explosion, take four Oct 14, 2022

thaJeztah mentioned this pull request Oct 16, 2022

pkg/chrootarchive: replace system.MkdirAll for os.Mkdir, use t.TempDir() #44306

Merged

corhere mentioned this pull request Oct 18, 2022

Replace overlay2 mount reexec with in-proc impl #44285

Merged

corhere marked this pull request as ready for review October 18, 2022 19:52

corhere requested a review from tianon as a code owner October 18, 2022 19:52

corhere force-pushed the chrootarchive-without-reexec branch from 8a422aa to e4cc8c0 Compare October 21, 2022 17:51

corhere mentioned this pull request Oct 21, 2022

[WIP] Separate static binary for chrootarchive #43186

Closed

neersighted self-requested a review October 24, 2022 17:26

neersighted requested changes Oct 25, 2022

View reviewed changes

tianon reviewed Oct 26, 2022

View reviewed changes

corhere and others added 8 commits October 26, 2022 12:04

pkg/chrootarchive: stop reexec'ing before chroot

5de2296

Unshare the thread's file system attributes and, if applicable, mount namespace so that the chroot operation does not affect the rest of the process. Signed-off-by: Cory Snider <[email protected]>

Add reusable chroot and unshare utilities

60ee6f7

Refactor pkg/chrootarchive in terms of those utilities. Signed-off-by: Cory Snider <[email protected]>

daemon: refactor isOnlineFSOperationPermitted

4fd91c3

It is only applicable to Windows so it does not need to be called from platform-generic code. Fix locking in the Windows implementation. Signed-off-by: Cory Snider <[email protected]>

daemon: drop Windows-only code from archive_unix.go

6750d1b

Signed-off-by: Cory Snider <[email protected]>

corhere force-pushed the chrootarchive-without-reexec branch 2 times, most recently from 6e7f1e9 to 23e5af5 Compare October 26, 2022 18:21

neersighted approved these changes Oct 26, 2022

View reviewed changes

corhere added 2 commits October 27, 2022 12:52

corhere force-pushed the chrootarchive-without-reexec branch from 23e5af5 to dcd6c1d Compare October 27, 2022 16:52

tianon approved these changes Nov 7, 2022

View reviewed changes

cpuguy83 approved these changes Nov 11, 2022

View reviewed changes

cpuguy83 merged commit 6eab4f5 into moby:master Nov 11, 2022

neersighted added area/storage Image Storage kind/refactor PR's that refactor, or clean-up code labels Nov 11, 2022

thaJeztah mentioned this pull request Nov 14, 2022

fix vendor.mod: add hashicorp/go-multierror as direct dependency #44453

Merged

thaJeztah reviewed Nov 14, 2022

View reviewed changes

corhere deleted the chrootarchive-without-reexec branch November 17, 2022 18:12

corhere mentioned this pull request Nov 29, 2022

pkg/chrootarchive DNS resolution causing unexpected long delay before container start #44540

Open

crazy-max mentioned this pull request Jan 2, 2023

[23.0 backport] Dockerfile: use TARGETPLATFORM to build Docker #44736

Merged

thaJeztah added this to the v-next milestone Jan 7, 2023

rumpl mentioned this pull request Feb 6, 2023

containerd integration: docker run #44804

Merged

crazy-max mentioned this pull request Jun 6, 2023

FreeBSD port moby/buildkit#2376

Merged

corhere mentioned this pull request Jun 29, 2023

Error creating overlay mount: no space left on device #45760

Closed

vvoland mentioned this pull request Sep 6, 2023

Copy command fails when container is using a specific user #46388

Open

thaJeztah mentioned this pull request Sep 6, 2023

pkg/idtools: remove sync.Once, and include lookup error #46417

Merged

nivbend mentioned this pull request Dec 25, 2023

Build fails on Mac #46987

Open

joerick mentioned this pull request Mar 8, 2024

tar: Error opening archive: Unrecognized archive format when copying wheels to host pypa/cibuildwheel#1782

Closed

jmeza-xyz mentioned this pull request Nov 26, 2025

RKE1 Cluster with Docker getting error "No space left on device" rancher/rancher#52800

Closed

vijayaramaraju-kalidindi mentioned this pull request Mar 24, 2026

Docker 20.10.5 overlay2: "no space left on device" during docker cp despite sufficient disk and inode availability #52201

Closed

Conversation

corhere commented Sep 27, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

fuweid commented Oct 18, 2022

Uh oh!

neersighted left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

neersighted commented Oct 25, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tianon Oct 24, 2022

Choose a reason for hiding this comment

Uh oh!

neersighted Oct 26, 2022

Choose a reason for hiding this comment

Uh oh!

tianon Oct 26, 2022

Choose a reason for hiding this comment

Uh oh!

corhere Oct 26, 2022

Choose a reason for hiding this comment

Uh oh!

corhere commented Oct 26, 2022

Uh oh!

neersighted commented Oct 26, 2022

Uh oh!

neersighted left a comment

Choose a reason for hiding this comment

Uh oh!

cpuguy83 commented Nov 11, 2022

Uh oh!

neersighted commented Nov 11, 2022

Uh oh!

thaJeztah Nov 14, 2022

Choose a reason for hiding this comment

Uh oh!

vijayaramaraju-kalidindi commented Mar 24, 2026

Uh oh!

corhere commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

corhere commented Sep 27, 2022 •

edited

Loading

neersighted commented Oct 25, 2022 •

edited

Loading