[WIP] Fix mount loop on "docker cp"#38993
Conversation
Codecov Report
@@ Coverage Diff @@
## master #38993 +/- ##
=========================================
Coverage ? 37%
=========================================
Files ? 612
Lines ? 45393
Branches ? 0
=========================================
Hits ? 16796
Misses ? 26318
Partials ? 2279 |
|
Got a number of errors like
coming from runc. Looking into |
|
Unfortunately this approach is not working well with some setups; I'll try an alternative way |
8cff273 to
416e467
Compare
|
We just bumped containerd and runc, is it worth checking if the new versions don't have this problem in combination with your patch? |
@thaJeztah Most probably not, but I rebased it just in case. |
Well.. CI is green 😉 (I recalled there were some issues fixed around groups not being found under certain conditions, and an SELinux issue fixed, which also produced a similar error) Not sure if CI is hitting the problem you were talking about though 😅 |
|
@kolyshkin do you think it's possible to write a test for this, or would that be highly flaky (if other containers are running, thus counting mounts could be affected by those?) |
|
Discussed this on maintainer's meeting. TODO items:
|
If the container bind mounts part of host filesystem including the daemon root (e.g. `docker run -v /var:/hostvar`), every `docker cp` (or other operation involving /archive API) leads to mounts inside the container being doubled, eventually leading to a broken container (or host). Here is a simple reproducer: ``` $ docker run --name mm -d -v /var:/hostvar busybox sleep 3d 73b50c2e626ad9378f429b20ba77355cf815bc9f846f19c173a0e62f57224ad3 $ docker exec mm wc -l /proc/self/mountinfo 86 /proc/self/mountinfo $ docker cp mm:/etc/group /dev/null $ docker exec mm wc -l /proc/self/mountinfo 185 /proc/self/mountinfo $ docker cp mm:/etc/group /dev/null $ docker cp mm:/etc/group /dev/null $ docker cp mm:/etc/group /dev/null $ docker cp mm:/etc/group /dev/null $ docker cp mm:/etc/group /dev/null $ docker exec mm wc -l /proc/self/mountinfo 6323 /proc/self/mountinfo $ docker cp mm:/etc/group /dev/null $ docker cp mm:/etc/group /dev/null $ docker cp mm:/etc/group /dev/null Error response from daemon: mount /:/var/lib/docker/overlay2/c9dbd9463b6c972fa712132d3177cfc19c808ed3e0dcd9a208f7ad487ad40a40/merged/rootfs, flags: 0x5000: no space left on device $ docker exec mm wc -l /proc/self/mountinfo 50675 /proc/self/mountinfo ``` Here, ENOENT happens because (since Linux kernel 4.9) the number of mounts per mount namespace is limited (by fs.mount-max sysctl), the default limit is 100000. The situation could be worse in case there is another mount namespace on the host (such as one created by systemd unit file with PrivateTmp=true), this leads to some kind of vicious circle, in which all the mounts from the container will be multiplied in that other mount namespace, eventually leading to kernel mount table exhaustion, making any mount return ENOSPC (aka "can't mount: no space left on device"). A reproducer for ntpd case is the same as above, with ntpd started beforehand (Red Hat 7): $ sudo yum install openntp $ sudo systemctl start ntpd ...(repro from above) $ sudo wc -l /proc/$(pidof ntpd)/mountinfo To solve this crysis, make container root (to which all the volumes will be mounted) unbindable, thus breaking the vicious circle. This would be a one-liner fix, but unfortunately, in case when the vfs graphdriver is used, the container root is not a mount, and so calling mount.MakeRUnbindable() adds another (bind) mount which need to be unmmounted once the copy operation is done. Let's do it in (container *).DetachAndUnmount(). NOTE that this is more like a bandaid than a real fix; it is small so it's easier to backport. A few alternative approaches to fix this could be: 1. For the purpose of `docker cp` (aka /archive API entrypoint), mount volumes to a directory other than container's root dir (and make that one unbindable). 2. Only mount those volumes that are required for the purpose of this particular /archive request. This would solve the issue _unless_ some files under /hostvar mount (as per above example) are requested. 3. For the purposes of `docker cp`, don't bind mount anything that can result in a mount loop (i.e. any parent of the container root directory). This might break compatibility. 4. Somehow delegate getting/putting files from/to a container to a lower levels (runc?). 5. Run dockerd in its own mount namespace (this would solve the ntpd issue above, but not the container's own issue). All these approaches look way more verbose (in terms of patch size) and less backport-able. Signed-off-by: Kir Kolyshkin <[email protected]>
6ace12c to
76fbe26
Compare
|
Rebased; added regression test; improved commit description. |
There was a problem hiding this comment.
Could try to remove it automatically if t.Verbose() == true
There was a problem hiding this comment.
I'm too lazy to do it :) but really, this would only be needed if debugging the test case code, and in that case it is going to be changed anyway, with 99% or so probability
Add a regression test for prev commit Signed-off-by: Kir Kolyshkin <[email protected]>
Oh well, I have just opened another pandora's box. Or is it can of worms? Here's the simple and stupid script I wrote: [root@kir-ce75-gd xe]# cat stress
set -e -u -o pipefail
NUMCTS=10
PARALLEL=2
DURATION=1m
rm -f .list .stop
echo "creating $NUMCTS containers"
for i in $(seq $NUMCTS); do
docker run -d --network=none -v /var:/hostvar busybox sleep 1d >> .list
done
copy() {
echo "keep copying..."
while [ ! -f .stop ]; do
shuf .list | parallel -j$PARALLEL -u docker cp '{}':/etc/group delete-me
done
echo "copying done"
}
restart() {
echo "keep restarting..."
while [ ! -f .stop ]; do
shuf .list | parallel -j$PARALLEL -u docker restart -t 0 '{}'
done
echo "restarting done"
}
echo TESTING
restart &
copy &
sleep $DURATION
echo STOPPING
touch .stop
time wait
echo "removing"
cat .list | parallel -j$PARALLEL -u docker rm -f '{}'
rm .list .stopRunning it on moby git tip results in errors in both
and
Full(er) log
Looks like this needs to be investigated further... |
|
I need to return to that one once I am finished with all the aufs and layer store optimizations saga. An alternative approach (such as not mounting volumes on |
|
One more thing: |
|
I think we could solve this in two ways: Possibly minimal changes: More involved change: |
|
Also, I just found yesterday that the whole mounting/unmounting is performed twice: once for In 99% of cases these mounts are not needed. I wonder if there's a way to mount these volumes conditionally/lazily (i.e. when needed). This will probably require some ugly logic :( |
|
Also, making the root mount |
|
Current plan: use a directory other than the |
|
Any progress? I'm being hit by this. Is there any workaround??? |
|
@rrebollo a workaround is to replace |
|
@kolyshkin I tried suggested workaround (the tar way because I want to copy a dir) but after 38 MB had the same result. |
|
@kolyshkin is there any workaround for copy from host to docker container. I am hitting the same issue, with just 6-7 containers running the issue reproduces. Increasing fs.mount-max (did it to see if this is actually same issue) does provide temporary relief |
|
@rajeshvig it's a similar to the one I have above, just the other way. Something like cat host_file | docker exec $CT cat > /path/to/container/fileor, if you need many files, tar cf - files* more_files/* | docker exec $CT tar -C dest_dir xf - |
|
thanks, but it has a more serious issue associated with it. When the limit is breached we cannot spawn any new containers on the host. In our case it was cadvisor container with mounts hitting 99997 within container. |
This is a naive first attempt at resolving the issue reported in moby#38995 and moby#43390. This approach does pass the integration tests locally, and it passes my minimal reproduction. It additionally passes the short stress-test of the daemon. However, the attached integration test, while correct, does not properly fail, without this patch. This is because it is not triggered by Docker-in-Docker -- to see this bug, we have to be in the 'true root' mount namespace. There are a couple other possibilities we can explore -- the first one I see is an [alternate test location]. Testing the daemon without the container abstraction would take this from a regression test for `docker cp` to a test of the soundness of mount propagation during the special case of mounting the daemon root into the container. The second alternative is a completely different, larger-scale change, to use a disposable mount namespace in the engine to do all container mounting in daemon/archive. This is more invasive and harder to prototype, but less likely to have issues in the future as we can completely sidestep this category of issues (interactions with the root mount namespace). Eyeballs and input on this approach, or the merits of the other two approaches (as well as anything I haven't forseen) is appreciated. I would like feedback from upstream (especially on how to test/validate this) before I go to far ahead with more revisions. A prior attempt at solving this can be found at moby#38993. [alternate test location]: https://github.com/moby/moby/blob/c55a4ac7795c7606b548b38e24673733481e2167/daemon/daemon_linux_test.go Signed-off-by: Bjorn Neergaard <[email protected]>
This is a naive first attempt at resolving the issue reported in moby#38995 and moby#43390. This approach does pass the integration tests locally, and it passes my minimal reproduction. It additionally passes the short stress-test of the daemon. However, the attached integration test, while correct, does not properly fail, without this patch. This is because it is not triggered by Docker-in-Docker -- to see this bug, we have to be in the 'true root' mount namespace. There are a couple other possibilities we can explore -- the first one I see is an [alternate test location]. Testing the daemon without the container abstraction would take this from a regression test for `docker cp` to a test of the soundness of mount propagation during the special case of mounting the daemon root into the container. The second alternative is a completely different, larger-scale change, to use a disposable mount namespace in the engine to do all container mounting in daemon/archive. This is more invasive and harder to prototype, but less likely to have issues in the future as we can completely sidestep this category of issues (interactions with the root mount namespace). Eyeballs and input on this approach, or the merits of the other two approaches (as well as anything I haven't forseen) is appreciated. I would like feedback from upstream (especially on how to test/validate this) before I go to far ahead with more revisions. A prior attempt at solving this can be found at moby#38993. [alternate test location]: https://github.com/moby/moby/blob/c55a4ac7795c7606b548b38e24673733481e2167/daemon/daemon_linux_test.go Signed-off-by: Bjorn Neergaard <[email protected]>
Problem
mountVolumes() is called to temporary mount volumes to the host,
for the purpose of copying the files to/from a container.
If the container bind mounts part of host filesystem including
the daemon root (e.g.
docker run -v /var:/hostvar), everynext
docker cp(or other operation involving/archiveAPI)will lead to mounts inside the container being doubled,
eventually leading to kernel mount table exhaustion, making
any mount return
ENOSPC(aka "can't mount: no space left on device").Here is a complete reproducer:
Here,
ENOENThappens because (since Linux kernel 4.9) the number ofmounts per mount namespace is limited (by
fs.mount-maxsysctl),the default limit is 100000.
The situation could be worse in case there is another mount
namespace on the host (such as one created by systemd unit
file with
PrivateTmp=true), this leads to some kind ofvicious circle, in which all the mounts from the container
will be multiplied in that other mount namespace, eventually
leading to kernel mount table exhaustion, making any mount
return
ENOSPC(aka "can't mount: no space left on device").A reproducer for ntpd case is the same as above, plus with ntpd
started beforehand (RHEL/CentOS 7.x):
Solution
To solve this crysis, make container root (to which all the volumes
will be mounted) unbindable, thus breaking the vicious circle; but
only do that if container root is already a mount, so that we
don't create any new bind mounts.
Testing
A regression test is added.
TODO: do some manual stress testing (parallel cp and start/stop)