Skip to content

Fix 'docker cp' mount table explosion, take four#44210

Merged
cpuguy83 merged 11 commits intomoby:masterfrom
corhere:chrootarchive-without-reexec
Nov 11, 2022
Merged

Fix 'docker cp' mount table explosion, take four#44210
cpuguy83 merged 11 commits intomoby:masterfrom
corhere:chrootarchive-without-reexec

Conversation

@corhere
Copy link
Contributor

@corhere corhere commented Sep 27, 2022

- What I did

- How I did it

  • I got the issue to reproduce in a DinD container by changing the mount propagation mode to shared, which matches the behaviour of SystemD PID 1 outside of a container.
  • I arranged to have the daemon bind-mount the container volumes and perform the archiving operation inside an unshared mount namespace so the mount events would not propagate into the container's namespace.

- How to verify it

  • Shiny new integration test

- Description for the changelog

  • Fixed docker cp failing with "no space left on device" when the daemon root is mounted into the container as a volume

- A picture of a cute animal (not mandatory but encouraged)

@corhere corhere force-pushed the chrootarchive-without-reexec branch 2 times, most recently from c9f966f to c663e5c Compare September 28, 2022 21:33
@corhere corhere force-pushed the chrootarchive-without-reexec branch 4 times, most recently from 36182c9 to 311c00d Compare October 14, 2022 19:19
@corhere corhere changed the title [WIP] Chrootarchive without reexec Fix 'docker cp' mount table explosion, take four Oct 14, 2022
@fuweid
Copy link
Contributor

fuweid commented Oct 18, 2022

/cc

@corhere corhere marked this pull request as ready for review October 18, 2022 19:52
@corhere corhere requested a review from tianon as a code owner October 18, 2022 19:52
@corhere corhere force-pushed the chrootarchive-without-reexec branch from 8a422aa to e4cc8c0 Compare October 21, 2022 17:51
@neersighted neersighted self-requested a review October 24, 2022 17:26
Copy link
Member

@neersighted neersighted left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this is a lot easier to follow than the existing code, even with the use of cancelable asynchronous execution in the containerFSView thread. I do wonder if it would make sense to pool/reference count these threads, so that concurrent copy operations do not churn threads; it seems like the only barrier to doing so (which should likely be a follow-up PR) is complexity of implementation.

I tested this PR with both Docker-in-Docker (yay, the mount --make-rshared find makes this fully reproducible in dind!) and on a 'bare' Linux system (VM, no Docker-in-Docker), and all works as expected with both the minimal shell-based reproduction from my PR, as well as the stress test based on @kolyshkin's work in #38993.

Some final thoughts:

  • Can we drop container.StatPath entirely from container/archive.go? It seems like the only usages were here.
  • There is a moderate cleanup of the path handling code in archive_unix.go, but there is some duplication and ambiguity now, comments would help a lot.

@neersighted
Copy link
Member

neersighted commented Oct 25, 2022

For posterity, I'm using the following to reproduce the issue:

docker run --name mount-repro -d -v /var/run:/var/run busybox sleep 1d
docker exec -it mount-repro cat /proc/self/mountinfo | wc -l
docker cp mount-repro:/etc/passwd ./testfile
docker exec -it mount-repro cat /proc/self/mountinfo | wc -l
docker rm -f mount-repro
rm testfile

This stress-test hammers the daemon with copies and container restarts and blows up the mount table if the bug is not fixed (and is an improvement of a version of the same created by @kolyshkin in #38993):

#!/usr/bin/env bash

set -e

NCONTAINERS=${NCONTAINERS:-$((2*$(nproc)))}
NPARALLEL=${NPARALLEL:-4}
DURATION=${DURATION:-1m}

echo "Creating $NCONTAINERS containers for stress test..."
for _ in $(seq "$NCONTAINERS"); do
  docker run -d -v /var/run:/var/run busybox sleep 1d | tee -a stress-containers
done

OUTDIR=$(mktemp -d)

cleanup() {
  echo "Reaping tasks..."
  for job in $(jobs -p); do
    kill "$job"
  done

  wait

  echo "Killing containers..."
  parallel -u -j"$NPARALLEL" -a stress-containers docker rm -f '{}'
  rm -f stress-containers
  rm -rf "$OUTDIR"
}; trap 'cleanup' EXIT

copy() {
  echo "Starting copy..."
  while test -f stress-containers; do
    parallel -u -j"$NPARALLEL" -a <(shuf stress-containers) docker cp '{}':/etc/passwd "$(mktemp "${OUTDIR}/stress-XXXX")"
  done
}; copy &

restart() {
  echo "Starting restart..."
  while test -f stress-containers; do
    parallel -u -j"$NPARALLEL" -a <(shuf stress-containers) docker restart -t0 '{}'
  done
}; restart &

sleep "$DURATION"
echo "Test complete!"


# Change mount propagation to shared to make the environment more similar to a
# modern Linux system, e.g. with SystemD as PID 1.
mount --make-rshared /
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This script is used by others (https://hub.docker.com/_/docker, for example) as the canonical "what's necessary for Docker-in-Docker to function correctly" -- I'm guessing this change probably isn't appropriate in most cases outside our tests, right? 😅

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that to the contrary, it is necessary, as it makes dind's environment more like that of a 'real' Linux system (read: modern init system like systemd).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I mean is that this change technically changes the way the container shares mounts with the host, right? (It's not restricted just to containers of the Docker-in-Docker instance.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it is restricted just to containers of the DinD instance. Any mount namespaces cloned from the container's namespace will receive un/mount events from the container's namespace. Mount propagation to/from the host namespace is left unchanged.

@corhere
Copy link
Contributor Author

corhere commented Oct 26, 2022

I do wonder if it would make sense to pool/reference count these threads, so that concurrent copy operations do not churn threads; it seems like the only barrier to doing so (which should likely be a follow-up PR) is complexity of implementation.

I've thought about doing so, too. While it would be fun to write, I think it's a premature optimization until benchmarked otherwise: KISS, YAGNI. This implementation is already orders of magnitude faster than fork/exec'ing a whole new Go process (msecs to µsecs!) so any further optimizations will yield diminishing returns.

Some final thoughts:

  • Can we drop container.StatPath entirely from container/archive.go? It seems like the only usages were here.

How can you drop what has already been deleted? :philosoraptor:

Whatever you meant to ask, the answer is probably "no, because Windows support."

  • There is a moderate cleanup of the path handling code in archive_unix.go, but there is some duplication and ambiguity now, comments would help a lot.

Which one are you referring to: daemon/archive_unix.go or pkg/chrootarchive/archive_unix.go?

@neersighted
Copy link
Member

How can you drop what has already been deleted? :philosoraptor:

Whatever you meant to ask, the answer is probably "no, because Windows support."

Ah, that's my bad -- I was testing across OSes and saw the symbol was still there in Goland; I forgot you renamed it to have a _$GOOS tag.

Which one are you referring to: daemon/archive_unix.go or pkg/chrootarchive/archive_unix.go?

daemon/archive_unix.go -- I have individual comments in that file which I'll follow up on; the clarity of path-handling code there is my main reservation.

corhere and others added 8 commits October 26, 2022 12:04
Modify the DinD entrypoint scripts to make the issue reproducible inside
a DinD container.

Co-authored-by: Bjorn Neergaard <[email protected]>
Signed-off-by: Bjorn Neergaard <[email protected]>
Signed-off-by: Cory Snider <[email protected]>
The applyLayer implementation in pkg/chrootarchive has to set the TMPDIR
environment variable so that archive.UnpackLayer() can successfully
create the whiteout-file temp directory. Change UnpackLayer to create
the temporary directory under the destination path so that environment
variables do not need to be touched.

Signed-off-by: Cory Snider <[email protected]>
Unshare the thread's file system attributes and, if applicable, mount
namespace so that the chroot operation does not affect the rest of the
process.

Signed-off-by: Cory Snider <[email protected]>
This change was introduced early in the development of rootless support,
before all the kinks were worked out and rootlesskit was built. The
author was testing the daemon by inside a user namespace set up by runc,
observed that the unshare(2) syscall was returning EPERM, and assumed
that it was a fundamental limitation of user namespaces. Seeing as the
kernel documentation (of today) disagrees with that assessment and that
unshare demonstrably works inside user namespaces, I can only assume
that the EPERM was due to a quirk of their test environment, such as a
seccomp filter set up by runc blocking the unshare syscall.
moby#20902 (comment)

Mount namespaces are necessary to address moby#38995 and moby#43390. Revert the
special-casing so those issues can also be fixed for rootless daemons.

This reverts commit dc95056.

Signed-off-by: Cory Snider <[email protected]>
Refactor pkg/chrootarchive in terms of those utilities.

Signed-off-by: Cory Snider <[email protected]>
The Linux implementation needs to diverge significantly from the Windows
one in order to fix platform-specific bugs. Cut the generic
implementation out of daemon/archive.go and paste identical, verbatim
copies of that implementation into daemon/archive_{windows,linux}.go to
make it easier to compare the progression of changes to the respective
implementations through Git history.

Signed-off-by: Cory Snider <[email protected]>
It is only applicable to Windows so it does not need to be called from
platform-generic code. Fix locking in the Windows implementation.

Signed-off-by: Cory Snider <[email protected]>
The existing archive implementation is not easy to reason about by
reading the source. Prepare to rewrite it by covering more edge cases in
tests. The new test cases were determined by black-box characterizing
the existing behaviour.

Signed-off-by: Cory Snider <[email protected]>
@corhere corhere force-pushed the chrootarchive-without-reexec branch 2 times, most recently from 6e7f1e9 to 23e5af5 Compare October 26, 2022 18:21
Copy link
Member

@neersighted neersighted left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good to me -- the major increase in readability and clear intention/maintainability of the code is a big win over the previous implementation, even if there is more surface-level complexity (locked goroutines, unshare(2)/kernel APIs, cross-goroutine coordination/async).

Mounting a container's volumes under its rootfs directory inside the
host mount namespace causes problems with cross-namespace mount
propagation when /var/lib/docker is bind-mounted into the container as a
volume. The mount event propagates into the container's mount namespace,
overmounting the volume, but the propagated unmount events do not fully
reverse the effect. Each archive operation causes the mount table in the
container's mount namespace to grow larger and larger, until the kernel
limiton the number of mounts in a namespace is hit. The only solution to
this issue which is not subject to race conditions or other blocker
caveats is to avoid mounting volumes into the container's rootfs
directory in the host mount namespace in the first place.

Mount the container volumes inside an unshared mount namespace to
prevent any mount events from propagating into any other mount
namespace. Greatly simplify the archiving implementations by also
chrooting into the container rootfs to sidestep the need to resolve
paths in the host.

Signed-off-by: Cory Snider <[email protected]>
The new daemon.containerFSView type covers all the use-cases on Linux
with a much more intuitive API, but is not portable to Windows.
Discourage people from using the old and busted functions in new Linux
code by excluding them entirely from Linux builds.

Signed-off-by: Cory Snider <[email protected]>
@corhere corhere force-pushed the chrootarchive-without-reexec branch from 23e5af5 to dcd6c1d Compare October 27, 2022 16:52
@cpuguy83 cpuguy83 merged commit 6eab4f5 into moby:master Nov 11, 2022
@cpuguy83
Copy link
Member

Great work.
Is this going to be back ported to 22.06?

@neersighted
Copy link
Member

It was mentioned at one point, but Cory and Sebastiaan were vehemently opposed due to the late state of the 22.06 branch and the risk/scope of this change.

@neersighted neersighted added area/storage Image Storage kind/refactor PR's that refactor, or clean-up code labels Nov 11, 2022
"runtime"
"strings"

"github.com/hashicorp/go-multierror"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes made go-multierror a direct dependency (no longer indirect only); looks like CI didn't pick that up (wondering if we could have a way to "fast check" such changes 🤔); opened #44453 to fix that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/storage Image Storage kind/refactor PR's that refactor, or clean-up code

Projects

None yet

6 participants