Skip to content

[WIP] Fix mount loop on "docker cp"#38993

Closed
kolyshkin wants to merge 2 commits intomoby:masterfrom
kolyshkin:mount-loop
Closed

[WIP] Fix mount loop on "docker cp"#38993
kolyshkin wants to merge 2 commits intomoby:masterfrom
kolyshkin:mount-loop

Conversation

@kolyshkin
Copy link
Contributor

@kolyshkin kolyshkin commented Apr 3, 2019

Problem

mountVolumes() is called to temporary mount volumes to the host,
for the purpose of copying the files to/from a container.
If the container bind mounts part of host filesystem including
the daemon root (e.g. docker run -v /var:/hostvar), every
next docker cp (or other operation involving /archive API)
will lead to mounts inside the container being doubled,
eventually leading to kernel mount table exhaustion, making
any mount return ENOSPC (aka "can't mount: no space left on device").

Here is a complete reproducer:

$ docker run --name mm -d -v /:/rootfs busybox sleep 3d
73b50c2e626ad9378f429b20ba77355cf815bc9f846f19c173a0e62f57224ad3
$ docker exec mm wc -l /proc/self/mountinfo
86 /proc/self/mountinfo
$ docker cp mm:/etc/group /dev/null
$ docker exec mm wc -l /proc/self/mountinfo
185 /proc/self/mountinfo
$ docker cp mm:/etc/group /dev/null
$ docker cp mm:/etc/group /dev/null
$ docker cp mm:/etc/group /dev/null
$ docker cp mm:/etc/group /dev/null
$ docker cp mm:/etc/group /dev/null
$ docker exec mm wc -l /proc/self/mountinfo
6323 /proc/self/mountinfo
$ docker cp mm:/etc/group /dev/null
$ docker cp mm:/etc/group /dev/null
$ docker cp mm:/etc/group /dev/null
Error response from daemon: mount /:/var/lib/docker/overlay2/c9dbd9463b6c972fa712132d3177cfc19c808ed3e0dcd9a208f7ad487ad40a40/merged/rootfs, flags: 0x5000: no space left on device
$ docker exec mm wc -l /proc/self/mountinfo
50675 /proc/self/mountinfo

Here, ENOENT happens because (since Linux kernel 4.9) the number of
mounts per mount namespace is limited (by fs.mount-max sysctl),
the default limit is 100000.

The situation could be worse in case there is another mount
namespace on the host (such as one created by systemd unit
file with PrivateTmp=true), this leads to some kind of
vicious circle, in which all the mounts from the container
will be multiplied in that other mount namespace, eventually
leading to kernel mount table exhaustion, making any mount
return ENOSPC (aka "can't mount: no space left on device").

A reproducer for ntpd case is the same as above, plus with ntpd
started beforehand (RHEL/CentOS 7.x):

$ sudo yum install openntp
$ sudo systemctl start ntpd
$ sudo wc -l /proc/$(pidof ntpd)/mountinfo
 ...(repro from above)
$ sudo wc -l /proc/$(pidof ntpd)/mountinfo

Solution

To solve this crysis, make container root (to which all the volumes
will be mounted) unbindable, thus breaking the vicious circle; but
only do that if container root is already a mount, so that we
don't create any new bind mounts.

Testing

A regression test is added.

TODO: do some manual stress testing (parallel cp and start/stop)

@kolyshkin kolyshkin changed the title [WIP] Mount loop fix [WIP] "docker cp" mount loop fix Apr 3, 2019
@codecov
Copy link

codecov bot commented Apr 3, 2019

Codecov Report

❗ No coverage uploaded for pull request base (master@f7ec606). Click here to learn what that means.
The diff coverage is 0%.

@@            Coverage Diff            @@
##             master   #38993   +/-   ##
=========================================
  Coverage          ?      37%           
=========================================
  Files             ?      612           
  Lines             ?    45393           
  Branches          ?        0           
=========================================
  Hits              ?    16796           
  Misses            ?    26318           
  Partials          ?     2279

@kolyshkin
Copy link
Contributor Author

Got a number of errors like

01:03:18.695 Error response from daemon: OCI runtime create failed: container_linux.go:345: starting container process caused "process_linux.go:424: container init caused "rootfs_linux.go:46: preparing rootfs caused \"invalid argument\""": unknown

coming from runc.

Looking into

@kolyshkin
Copy link
Contributor Author

Unfortunately this approach is not working well with some setups; I'll try an alternative way

@kolyshkin kolyshkin force-pushed the mount-loop branch 3 times, most recently from 8cff273 to 416e467 Compare April 9, 2019 21:01
@thaJeztah
Copy link
Member

We just bumped containerd and runc, is it worth checking if the new versions don't have this problem in combination with your patch?

@kolyshkin
Copy link
Contributor Author

We just bumped containerd and runc, is it worth checking if the new versions don't have this problem in combination with your patch?

@thaJeztah Most probably not, but I rebased it just in case.

@thaJeztah
Copy link
Member

@thaJeztah Most probably not, but I rebased it just in case.

Well.. CI is green 😉 (I recalled there were some issues fixed around groups not being found under certain conditions, and an SELinux issue fixed, which also produced a similar error)

Not sure if CI is hitting the problem you were talking about though 😅

@thaJeztah thaJeztah changed the title [WIP] "docker cp" mount loop fix [WIP] Fix mount loop on "docker cp" Apr 10, 2019
@thaJeztah
Copy link
Member

@kolyshkin do you think it's possible to write a test for this, or would that be highly flaky (if other containers are running, thus counting mounts could be affected by those?)

@thaJeztah
Copy link
Member

/cc @cpuguy83 @AkihiroSuda

@thaJeztah thaJeztah changed the title [WIP] Fix mount loop on "docker cp" Fix mount loop on "docker cp" Apr 11, 2019
@kolyshkin
Copy link
Contributor Author

Discussed this on maintainer's meeting.

TODO items:

  • add a regression test to integration;
  • run some manual stress testing (docker cp and start/stop in parallel) with different graph drivers;
  • amend the comments to mention this depends on container being locked for the duration of docker cp operation;
  • amend the description to say that
    • this is a quick fix which is back-portable;
    • what would be a good long-term fix.

If the container bind mounts part of host filesystem including
the daemon root (e.g. `docker run -v /var:/hostvar`), every
`docker cp` (or other operation involving /archive API)
leads to mounts inside the container being doubled,
eventually leading to a broken container (or host).

Here is a simple reproducer:

```
$ docker run --name mm -d -v /var:/hostvar busybox sleep 3d
73b50c2e626ad9378f429b20ba77355cf815bc9f846f19c173a0e62f57224ad3
$ docker exec mm wc -l /proc/self/mountinfo
86 /proc/self/mountinfo
$ docker cp mm:/etc/group /dev/null
$ docker exec mm wc -l /proc/self/mountinfo
185 /proc/self/mountinfo
$ docker cp mm:/etc/group /dev/null
$ docker cp mm:/etc/group /dev/null
$ docker cp mm:/etc/group /dev/null
$ docker cp mm:/etc/group /dev/null
$ docker cp mm:/etc/group /dev/null
$ docker exec mm wc -l /proc/self/mountinfo
6323 /proc/self/mountinfo
$ docker cp mm:/etc/group /dev/null
$ docker cp mm:/etc/group /dev/null
$ docker cp mm:/etc/group /dev/null
Error response from daemon: mount /:/var/lib/docker/overlay2/c9dbd9463b6c972fa712132d3177cfc19c808ed3e0dcd9a208f7ad487ad40a40/merged/rootfs, flags: 0x5000: no space left on device
$ docker exec mm wc -l /proc/self/mountinfo
50675 /proc/self/mountinfo
```

Here, ENOENT happens because (since Linux kernel 4.9) the number of
mounts per mount namespace is limited (by fs.mount-max sysctl),
the default limit is 100000.

The situation could be worse in case there is another mount
namespace on the host (such as one created by systemd unit
file with PrivateTmp=true), this leads to some kind of
vicious circle, in which all the mounts from the container
will be multiplied in that other mount namespace, eventually
leading to kernel mount table exhaustion, making any mount
return ENOSPC (aka "can't mount: no space left on device").

A reproducer for ntpd case is the same as above, with ntpd
started beforehand (Red Hat 7):

$ sudo yum install openntp
$ sudo systemctl start ntpd
...(repro from above)
$ sudo wc -l /proc/$(pidof ntpd)/mountinfo

To solve this crysis, make container root (to which all the volumes
will be mounted) unbindable, thus breaking the vicious circle.

This would be a one-liner fix, but unfortunately, in case when
the vfs graphdriver is used, the container root is not a mount,
and so calling mount.MakeRUnbindable() adds another (bind) mount
which need to be unmmounted once the copy operation is done.
Let's do it in (container *).DetachAndUnmount().

NOTE that this is more like a bandaid than a real fix; it is
small so it's easier to backport.

A few alternative approaches to fix this could be:

1. For the purpose of `docker cp` (aka /archive API entrypoint),
mount volumes to a directory other than container's root dir
(and make that one unbindable).

2. Only mount those volumes that are required for the purpose
of this particular /archive request. This would solve the issue
_unless_ some files under /hostvar mount (as per above example) are
requested.

3. For the purposes of `docker cp`, don't bind mount anything that
can result in a mount loop (i.e. any parent of the container root
directory). This might break compatibility.

4. Somehow delegate getting/putting files from/to a container
to a lower levels (runc?).

5. Run dockerd in its own mount namespace (this would solve
the ntpd issue above, but not the container's own issue).

All these approaches look way more verbose (in terms of patch
size) and less backport-able.

Signed-off-by: Kir Kolyshkin <[email protected]>
@kolyshkin kolyshkin force-pushed the mount-loop branch 2 times, most recently from 6ace12c to 76fbe26 Compare April 11, 2019 23:25
@kolyshkin
Copy link
Contributor Author

Rebased; added regression test; improved commit description.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could try to remove it automatically if t.Verbose() == true

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm too lazy to do it :) but really, this would only be needed if debugging the test case code, and in that case it is going to be changed anyway, with 99% or so probability

Add a regression test for prev commit

Signed-off-by: Kir Kolyshkin <[email protected]>
@kolyshkin
Copy link
Contributor Author

kolyshkin commented Apr 12, 2019

run some manual stress testing (docker cp and start/stop in parallel) with different graph drivers;

Oh well, I have just opened another pandora's box. Or is it can of worms?

Here's the simple and stupid script I wrote:

[root@kir-ce75-gd xe]# cat stress
set -e -u -o pipefail
NUMCTS=10
PARALLEL=2
DURATION=1m

rm -f .list .stop

echo "creating $NUMCTS containers"
for i in $(seq $NUMCTS); do
	docker run -d --network=none -v /var:/hostvar busybox sleep 1d >> .list
done

copy() {
	echo "keep copying..."
	while [ ! -f .stop ]; do
		shuf .list | parallel -j$PARALLEL -u docker cp '{}':/etc/group delete-me
	done
	echo "copying done"
}

restart() {
	echo "keep restarting..."
	while [ ! -f .stop ]; do
		shuf .list | parallel -j$PARALLEL -u docker restart -t 0 '{}'
	done
	echo "restarting done"
}

echo TESTING
restart &
copy &

sleep $DURATION

echo STOPPING
touch .stop
time wait

echo "removing"
cat .list | parallel -j$PARALLEL -u docker rm -f '{}'
rm .list .stop

Running it on moby git tip results in errors in both restart and cp. This happens with or without commits from this PR 😕. The two kind of errors are:

Error: No such container:path: 42e32d8eefe384ad396d52c3c7cebc91067deffc0dac28a00f76f37f38abe9fe:/etc/group

and

Error response from daemon: Cannot restart container 7c21202c32ad034f7ce2734d7efc4288ebf265ec679b939b0040a2397fab62e0: OCI runtime create failed: container_linux.go:344: starting container process caused "exec: "sleep": executable file not found in $PATH": unknown

Full(er) log

# bash -x ./stress
+ set -e -u -o pipefail
+ NUMCTS=10
+ PARALLEL=2
+ DURATION=1m
+ rm -f .list .stop
+ echo 'creating 10 containers'
creating 10 containers
++ seq 10
+ for i in '$(seq $NUMCTS)'
+ docker run -d --network=none -v /var:/hostvar busybox sleep 1d
+ for i in '$(seq $NUMCTS)'
+ docker run -d --network=none -v /var:/hostvar busybox sleep 1d
+ for i in '$(seq $NUMCTS)'
+ docker run -d --network=none -v /var:/hostvar busybox sleep 1d
+ for i in '$(seq $NUMCTS)'
+ docker run -d --network=none -v /var:/hostvar busybox sleep 1d
+ for i in '$(seq $NUMCTS)'
+ docker run -d --network=none -v /var:/hostvar busybox sleep 1d
+ for i in '$(seq $NUMCTS)'
+ docker run -d --network=none -v /var:/hostvar busybox sleep 1d
+ for i in '$(seq $NUMCTS)'
+ docker run -d --network=none -v /var:/hostvar busybox sleep 1d
+ for i in '$(seq $NUMCTS)'
+ docker run -d --network=none -v /var:/hostvar busybox sleep 1d
+ for i in '$(seq $NUMCTS)'
+ docker run -d --network=none -v /var:/hostvar busybox sleep 1d
+ for i in '$(seq $NUMCTS)'
+ docker run -d --network=none -v /var:/hostvar busybox sleep 1d
+ echo TESTING
TESTING
+ restart
+ echo 'keep restarting...'
keep restarting...
+ '[' '!' -f .stop ']'
+ sleep 1m
+ copy
+ echo 'keep copying...'
keep copying...
+ '[' '!' -f .stop ']'
+ parallel -j2 -u docker restart -t 0 '{}'
+ shuf .list
+ parallel -j2 -u docker cp '{}:/etc/group' delete-me
+ shuf .list
+ '[' '!' -f .stop ']'
+ shuf .list
+ parallel -j2 -u docker cp '{}:/etc/group' delete-me
Error: No such container:path: a0742f9ab8a9e3733cff1e3c4c7a6fbde5b89df431514e995025f02072eeb3c8:/etc/group
Error: No such container:path: 42e32d8eefe384ad396d52c3c7cebc91067deffc0dac28a00f76f37f38abe9fe:/etc/group
Error response from daemon: Cannot restart container 7c21202c32ad034f7ce2734d7efc4288ebf265ec679b939b0040a2397fab62e0: OCI runtime create failed: container_linux.go:344: starting container process caused "exec: \"sleep\": executable file not found in $PATH": unknown
Error: No such container:path: eb73d8a89c55a3edd4c1930f44a7c7c998d2b137f3d619a4fd1faeb5fc32c62b:/etc/group
Error: No such container:path: 66849f7a28985ed9c2b1d44f82b0b8cea8f562cab68a364c4868606497eb699d:/etc/group
Error response from daemon: Cannot restart container 5ae6caaae389d983e5a4c1b9f3ba1597da23c2ed709770544bd1bbf2d6916b06: OCI runtime create failed: container_linux.go:344: starting container process caused "exec: \"sleep\": executable file not found in $PATH": unknown
Error: No such container:path: 53f01cc4289c7b83895bb96af3eb0f63b123fd86b2eff9e8e7c9e71ed349ae9d:/etc/group
Error: No such container:path: 335ac745c4c059f1bbf3a6424229e6768dd57a4b1abf524d0ff60f66e1a2db9e:/etc/group
Error: No such container:path: 85ea713dddf25ea4be83b071b6635a66e3d287e728e28b9c90e4fe9bf23c97f0:/etc/group
Error: No such container:path: 65c4021a3880a09001a1c2d6f8293c62fe12ea70d2a456cb11dd899b22555a3c:/etc/group
Error response from daemon: Cannot restart container eb73d8a89c55a3edd4c1930f44a7c7c998d2b137f3d619a4fd1faeb5fc32c62b: OCI runtime create failed: container_linux.go:344: starting container process caused "exec: \"sleep\": executable file not found in $PATH": unknown
Error response from daemon: Cannot restart container 53f01cc4289c7b83895bb96af3eb0f63b123fd86b2eff9e8e7c9e71ed349ae9d: OCI runtime create failed: container_linux.go:344: starting container process caused "exec: \"sleep\": executable file not found in $PATH": unknown
Error response from daemon: Cannot restart container 42e32d8eefe384ad396d52c3c7cebc91067deffc0dac28a00f76f37f38abe9fe: OCI runtime create failed: container_linux.go:344: starting container process caused "exec: \"sleep\": executable file not found in $PATH": unknown
Error response from daemon: Cannot restart container 85ea713dddf25ea4be83b071b6635a66e3d287e728e28b9c90e4fe9bf23c97f0: OCI runtime create failed: container_linux.go:344: starting container process caused "exec: \"sleep\": executable file not found in $PATH": unknown
Error response from daemon: Cannot restart container a0742f9ab8a9e3733cff1e3c4c7a6fbde5b89df431514e995025f02072eeb3c8: OCI runtime create failed: container_linux.go:344: starting container process caused "exec: \"sleep\": executable file not found in $PATH": unknown
....

Looks like this needs to be investigated further...

@thaJeztah thaJeztah changed the title Fix mount loop on "docker cp" [WIP] Fix mount loop on "docker cp" Apr 18, 2019
@kolyshkin
Copy link
Contributor Author

I need to return to that one once I am finished with all the aufs and layer store optimizations saga. An alternative approach (such as not mounting volumes on docker cp unless necessary) might be a way to go.

@kolyshkin
Copy link
Contributor Author

One more thing: systemd-udevd "collects" all these one-time mounts in its mount namespace ( /proc/$(pidof systemd-udevd)/mountinfo).

@cpuguy83
Copy link
Member

I think we could solve this in two ways:

Possibly minimal changes:
Drop to a new mount namespace before mounting all the things.
Requires locking the goroutine to an OS thread, and the OS thread will need to be thrown out afterwards (which I believe it should be doing now, whereas before it would re-use it).
This is kind of a quick and dirty way to do it but, at least in theory, should work.

More involved change:
Change the graphdriver interface to return a set of mounts that need to be performed rather than actually performing the mount.
Change the same for mountVolumes.
Invoke these mounts in a new mount namespace, possibly even in a new process.

@kolyshkin
Copy link
Contributor Author

Also, I just found yesterday that the whole mounting/unmounting is performed twice: once for HEAD (which checks the file is there) and once for PUT (of /containers/ID/archive) -- at least this is the way docker cp works for copying into a container (and copying from a container, if followSymlinks is set).

In 99% of cases these mounts are not needed. I wonder if there's a way to mount these volumes conditionally/lazily (i.e. when needed). This will probably require some ugly logic :(

@kolyshkin
Copy link
Contributor Author

Also, making the root mount unbindable should work, but looks like it breaks some tests, I need to figure out why exactly is this happening (I suspect there might be some other bugs hiding down there).

@kolyshkin
Copy link
Contributor Author

Current plan: use a directory other than the ContainerRoot for docker cp mounts, make it runbindable. Should fix the issue without breaking stuff left and right.

@rrebollo
Copy link

Any progress? I'm being hit by this. Is there any workaround???

@kolyshkin
Copy link
Contributor Author

@rrebollo a workaround is to replace docker cp with
docker exec $CT cat /path/to/container/file > host_file
or, if you need many files,
docker exec $CT tar cf - /some/container/dir | tar xf -

@rrebollo
Copy link

rrebollo commented Dec 9, 2019

@kolyshkin I tried suggested workaround (the tar way because I want to copy a dir) but after 38 MB had the same result.

@rajeshvig
Copy link

@kolyshkin is there any workaround for copy from host to docker container. I am hitting the same issue, with just 6-7 containers running the issue reproduces. Increasing fs.mount-max (did it to see if this is actually same issue) does provide temporary relief

@kolyshkin
Copy link
Contributor Author

@rajeshvig it's a similar to the one I have above, just the other way. Something like

cat host_file | docker exec $CT cat > /path/to/container/file

or, if you need many files,

tar cf - files* more_files/* | docker exec $CT tar -C dest_dir xf -

@rajeshvig
Copy link

thanks, but it has a more serious issue associated with it. When the limit is breached we cannot spawn any new containers on the host. In our case it was cadvisor container with mounts hitting 99997 within container.

neersighted added a commit to neersighted/moby that referenced this pull request May 11, 2022
This is a naive first attempt at resolving the issue reported in moby#38995
and moby#43390. This approach does pass the integration tests locally, and
it passes my minimal reproduction. It additionally passes the short
stress-test of the daemon.

However, the attached integration test, while correct, does not properly
fail, without this patch. This is because it is not triggered by
Docker-in-Docker -- to see this bug, we have to be in the 'true root'
mount namespace.

There are a couple other possibilities we can explore -- the first one I
see is an [alternate test location]. Testing the daemon without the
container abstraction would take this from a regression test for `docker
cp` to a test of the soundness of mount propagation during the special
case of mounting the daemon root into the container.

The second alternative is a completely different, larger-scale change,
to use a disposable mount namespace in the engine to do all container
mounting in daemon/archive. This is more invasive and harder to
prototype, but less likely to have issues in the future as we can
completely sidestep this category of issues (interactions with the root
mount namespace).

Eyeballs and input on this approach, or the merits of the other two
approaches (as well as anything I haven't forseen) is appreciated. I
would like feedback from upstream (especially on how to test/validate
this) before I go to far ahead with more revisions.

A prior attempt at solving this can be found at moby#38993.

  [alternate test location]: https://github.com/moby/moby/blob/c55a4ac7795c7606b548b38e24673733481e2167/daemon/daemon_linux_test.go

Signed-off-by: Bjorn Neergaard <[email protected]>
neersighted added a commit to neersighted/moby that referenced this pull request May 12, 2022
This is a naive first attempt at resolving the issue reported in moby#38995
and moby#43390. This approach does pass the integration tests locally, and
it passes my minimal reproduction. It additionally passes the short
stress-test of the daemon.

However, the attached integration test, while correct, does not properly
fail, without this patch. This is because it is not triggered by
Docker-in-Docker -- to see this bug, we have to be in the 'true root'
mount namespace.

There are a couple other possibilities we can explore -- the first one I
see is an [alternate test location]. Testing the daemon without the
container abstraction would take this from a regression test for `docker
cp` to a test of the soundness of mount propagation during the special
case of mounting the daemon root into the container.

The second alternative is a completely different, larger-scale change,
to use a disposable mount namespace in the engine to do all container
mounting in daemon/archive. This is more invasive and harder to
prototype, but less likely to have issues in the future as we can
completely sidestep this category of issues (interactions with the root
mount namespace).

Eyeballs and input on this approach, or the merits of the other two
approaches (as well as anything I haven't forseen) is appreciated. I
would like feedback from upstream (especially on how to test/validate
this) before I go to far ahead with more revisions.

A prior attempt at solving this can be found at moby#38993.

  [alternate test location]: https://github.com/moby/moby/blob/c55a4ac7795c7606b548b38e24673733481e2167/daemon/daemon_linux_test.go

Signed-off-by: Bjorn Neergaard <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants