Fix 'docker cp' mount table explosion, take four#44210
Fix 'docker cp' mount table explosion, take four#44210cpuguy83 merged 11 commits intomoby:masterfrom
Conversation
c9f966f to
c663e5c
Compare
36182c9 to
311c00d
Compare
|
/cc |
8a422aa to
e4cc8c0
Compare
neersighted
left a comment
There was a problem hiding this comment.
Overall this is a lot easier to follow than the existing code, even with the use of cancelable asynchronous execution in the containerFSView thread. I do wonder if it would make sense to pool/reference count these threads, so that concurrent copy operations do not churn threads; it seems like the only barrier to doing so (which should likely be a follow-up PR) is complexity of implementation.
I tested this PR with both Docker-in-Docker (yay, the mount --make-rshared find makes this fully reproducible in dind!) and on a 'bare' Linux system (VM, no Docker-in-Docker), and all works as expected with both the minimal shell-based reproduction from my PR, as well as the stress test based on @kolyshkin's work in #38993.
Some final thoughts:
- Can we drop
container.StatPathentirely fromcontainer/archive.go? It seems like the only usages were here. - There is a moderate cleanup of the path handling code in
archive_unix.go, but there is some duplication and ambiguity now, comments would help a lot.
|
For posterity, I'm using the following to reproduce the issue: docker run --name mount-repro -d -v /var/run:/var/run busybox sleep 1d
docker exec -it mount-repro cat /proc/self/mountinfo | wc -l
docker cp mount-repro:/etc/passwd ./testfile
docker exec -it mount-repro cat /proc/self/mountinfo | wc -l
docker rm -f mount-repro
rm testfileThis stress-test hammers the daemon with copies and container restarts and blows up the mount table if the bug is not fixed (and is an improvement of a version of the same created by @kolyshkin in #38993): #!/usr/bin/env bash
set -e
NCONTAINERS=${NCONTAINERS:-$((2*$(nproc)))}
NPARALLEL=${NPARALLEL:-4}
DURATION=${DURATION:-1m}
echo "Creating $NCONTAINERS containers for stress test..."
for _ in $(seq "$NCONTAINERS"); do
docker run -d -v /var/run:/var/run busybox sleep 1d | tee -a stress-containers
done
OUTDIR=$(mktemp -d)
cleanup() {
echo "Reaping tasks..."
for job in $(jobs -p); do
kill "$job"
done
wait
echo "Killing containers..."
parallel -u -j"$NPARALLEL" -a stress-containers docker rm -f '{}'
rm -f stress-containers
rm -rf "$OUTDIR"
}; trap 'cleanup' EXIT
copy() {
echo "Starting copy..."
while test -f stress-containers; do
parallel -u -j"$NPARALLEL" -a <(shuf stress-containers) docker cp '{}':/etc/passwd "$(mktemp "${OUTDIR}/stress-XXXX")"
done
}; copy &
restart() {
echo "Starting restart..."
while test -f stress-containers; do
parallel -u -j"$NPARALLEL" -a <(shuf stress-containers) docker restart -t0 '{}'
done
}; restart &
sleep "$DURATION"
echo "Test complete!" |
|
|
||
| # Change mount propagation to shared to make the environment more similar to a | ||
| # modern Linux system, e.g. with SystemD as PID 1. | ||
| mount --make-rshared / |
There was a problem hiding this comment.
This script is used by others (https://hub.docker.com/_/docker, for example) as the canonical "what's necessary for Docker-in-Docker to function correctly" -- I'm guessing this change probably isn't appropriate in most cases outside our tests, right? 😅
There was a problem hiding this comment.
I think that to the contrary, it is necessary, as it makes dind's environment more like that of a 'real' Linux system (read: modern init system like systemd).
There was a problem hiding this comment.
What I mean is that this change technically changes the way the container shares mounts with the host, right? (It's not restricted just to containers of the Docker-in-Docker instance.)
There was a problem hiding this comment.
No, it is restricted just to containers of the DinD instance. Any mount namespaces cloned from the container's namespace will receive un/mount events from the container's namespace. Mount propagation to/from the host namespace is left unchanged.
I've thought about doing so, too. While it would be fun to write, I think it's a premature optimization until benchmarked otherwise: KISS, YAGNI. This implementation is already orders of magnitude faster than fork/exec'ing a whole new Go process (msecs to µsecs!) so any further optimizations will yield diminishing returns.
How can you drop what has already been deleted? :philosoraptor: Whatever you meant to ask, the answer is probably "no, because Windows support."
Which one are you referring to: |
Ah, that's my bad -- I was testing across OSes and saw the symbol was still there in Goland; I forgot you renamed it to have a
|
Modify the DinD entrypoint scripts to make the issue reproducible inside a DinD container. Co-authored-by: Bjorn Neergaard <[email protected]> Signed-off-by: Bjorn Neergaard <[email protected]> Signed-off-by: Cory Snider <[email protected]>
The applyLayer implementation in pkg/chrootarchive has to set the TMPDIR environment variable so that archive.UnpackLayer() can successfully create the whiteout-file temp directory. Change UnpackLayer to create the temporary directory under the destination path so that environment variables do not need to be touched. Signed-off-by: Cory Snider <[email protected]>
Unshare the thread's file system attributes and, if applicable, mount namespace so that the chroot operation does not affect the rest of the process. Signed-off-by: Cory Snider <[email protected]>
This change was introduced early in the development of rootless support, before all the kinks were worked out and rootlesskit was built. The author was testing the daemon by inside a user namespace set up by runc, observed that the unshare(2) syscall was returning EPERM, and assumed that it was a fundamental limitation of user namespaces. Seeing as the kernel documentation (of today) disagrees with that assessment and that unshare demonstrably works inside user namespaces, I can only assume that the EPERM was due to a quirk of their test environment, such as a seccomp filter set up by runc blocking the unshare syscall. moby#20902 (comment) Mount namespaces are necessary to address moby#38995 and moby#43390. Revert the special-casing so those issues can also be fixed for rootless daemons. This reverts commit dc95056. Signed-off-by: Cory Snider <[email protected]>
Refactor pkg/chrootarchive in terms of those utilities. Signed-off-by: Cory Snider <[email protected]>
The Linux implementation needs to diverge significantly from the Windows
one in order to fix platform-specific bugs. Cut the generic
implementation out of daemon/archive.go and paste identical, verbatim
copies of that implementation into daemon/archive_{windows,linux}.go to
make it easier to compare the progression of changes to the respective
implementations through Git history.
Signed-off-by: Cory Snider <[email protected]>
It is only applicable to Windows so it does not need to be called from platform-generic code. Fix locking in the Windows implementation. Signed-off-by: Cory Snider <[email protected]>
Signed-off-by: Cory Snider <[email protected]>
The existing archive implementation is not easy to reason about by reading the source. Prepare to rewrite it by covering more edge cases in tests. The new test cases were determined by black-box characterizing the existing behaviour. Signed-off-by: Cory Snider <[email protected]>
6e7f1e9 to
23e5af5
Compare
neersighted
left a comment
There was a problem hiding this comment.
Overall looks good to me -- the major increase in readability and clear intention/maintainability of the code is a big win over the previous implementation, even if there is more surface-level complexity (locked goroutines, unshare(2)/kernel APIs, cross-goroutine coordination/async).
Mounting a container's volumes under its rootfs directory inside the host mount namespace causes problems with cross-namespace mount propagation when /var/lib/docker is bind-mounted into the container as a volume. The mount event propagates into the container's mount namespace, overmounting the volume, but the propagated unmount events do not fully reverse the effect. Each archive operation causes the mount table in the container's mount namespace to grow larger and larger, until the kernel limiton the number of mounts in a namespace is hit. The only solution to this issue which is not subject to race conditions or other blocker caveats is to avoid mounting volumes into the container's rootfs directory in the host mount namespace in the first place. Mount the container volumes inside an unshared mount namespace to prevent any mount events from propagating into any other mount namespace. Greatly simplify the archiving implementations by also chrooting into the container rootfs to sidestep the need to resolve paths in the host. Signed-off-by: Cory Snider <[email protected]>
The new daemon.containerFSView type covers all the use-cases on Linux with a much more intuitive API, but is not portable to Windows. Discourage people from using the old and busted functions in new Linux code by excluding them entirely from Linux builds. Signed-off-by: Cory Snider <[email protected]>
23e5af5 to
dcd6c1d
Compare
|
Great work. |
|
It was mentioned at one point, but Cory and Sebastiaan were vehemently opposed due to the late state of the 22.06 branch and the risk/scope of this change. |
| "runtime" | ||
| "strings" | ||
|
|
||
| "github.com/hashicorp/go-multierror" |
There was a problem hiding this comment.
This changes made go-multierror a direct dependency (no longer indirect only); looks like CI didn't pick that up (wondering if we could have a way to "fast check" such changes 🤔); opened #44453 to fix that.
- What I did
- How I did it
- How to verify it
- Description for the changelog
docker cpfailing with "no space left on device" when the daemon root is mounted into the container as a volume- A picture of a cute animal (not mandatory but encouraged)