Conversation
|
ae157b8 to
6dcf038
Compare
Codecov Report
@@ Coverage Diff @@
## master #39107 +/- ##
=========================================
Coverage ? 37.05%
=========================================
Files ? 612
Lines ? 45507
Branches ? 0
=========================================
Hits ? 16861
Misses ? 26353
Partials ? 2293 |
cpuguy83
left a comment
There was a problem hiding this comment.
I'm pretty sure I tried to remove this lock before and ran into trouble.
@cpuguy83 yeah, it took me quite a while (days) to figure out what the proper locking should be (and I have yet to update this PR with my new patchset). Apparently, parallel unmounts is fine, but there's a race in aufs kernel module that prevents parallel mounts, so this has to be done under a lock. Will try to bring this back to life today. |
|
OK, I have updated the patch set and did some benchmarks (on top of #39135, using the script from there). Parallel container creation is now twice faster, other timings are similar. |
|
@kolyshkin build is failing; |
|
Not sure if this is relevant (error on exp CI):
|
|
perhaps the name generator wasn't random enough? Interesting; never encountered that |
|
janky CI is flaky test (#32673)
|
|
adding a temp commit to stress-test TestService (run them many times) |
Both mount and unmount calls are already protected by fine-grained (per id) locks in Get()/Put() introduced in commit fc1cf19 ("Add more locking to storage drivers"), so there's no point in having a global lock in mount/unmount. The only place from which unmount is called without any locking is Cleanup() -- this is to be addressed in the next patch. This reverts commit 824c24e. Signed-off-by: Kir Kolyshkin <[email protected]>
1. Use mount.Unmount() which ignores EINVAL ("not mounted") error,
and provides better error diagnostics (so we don't have to explicitly
add target to error messages).
2. Since we're ignoring "not mounted" error, we can call
multiple unmounts without any locking -- but since "auplink flush"
is still involved and can produce an error in logs, let's keep
the check for fs being mounted (it's just a statfs so should be fast).
2. While at it, improve the "can't unmount" error message in Put().
Signed-off-by: Kir Kolyshkin <[email protected]>
Signed-off-by: Kir Kolyshkin <[email protected]>
Apparently there is some kind of race in aufs kernel module code, which leads to the errors like: [98221.158606] aufs au_xino_create2:186:dockerd[25801]: aufs.xino create err -17 [98221.162128] aufs au_xino_set:1229:dockerd[25801]: I/O Error, failed creating xino(-17). [98362.239085] aufs au_xino_create2:186:dockerd[6348]: aufs.xino create err -17 [98362.243860] aufs au_xino_set:1229:dockerd[6348]: I/O Error, failed creating xino(-17). [98373.775380] aufs au_xino_create:767:dockerd[27435]: open /dev/shm/aufs.xino(-17) [98389.015640] aufs au_xino_create2:186:dockerd[26753]: aufs.xino create err -17 [98389.018776] aufs au_xino_set:1229:dockerd[26753]: I/O Error, failed creating xino(-17). [98424.117584] aufs au_xino_create:767:dockerd[27105]: open /dev/shm/aufs.xino(-17) So, we have to have a lock around mount syscall. While at it, don't call the whole Unmount() on an error path, as it leads to bogus error from auplink flush. Signed-off-by: Kir Kolyshkin <[email protected]>
In case there are a big number of layers, so that mount data won't fit
into a single memory page (4096 bytes on most platforms, which is good
enough for about 40 layers, depending on how long graphdriver root path
is), we supply additional layers with O_REMOUNT, as described in aufs
documentation.
Problem is, the current implementation does that one layer at a time
(i.e. there is one mount syscall per each additional layer).
Optimize the code to supply as many layers as we can fit in one page
(basically reusing the same code as for the original mount).
Note, per aufs docs, "[a]t remount-time, the options are interpreted
in the given order, e.g. left to right" so we should be good.
Tested on an image with ~100 layers.
Before (35 syscalls):
> [pid 22756] 1556919088.686955 mount("none", "/mnt/volume_sfo2_09/docker-aufs/aufs/mnt/a86f8c9dd0ec2486293119c20b0ec026e19bbc4d51332c554f7cf05d777c9866", "aufs", 0, "br:/mnt/volume_sfo2_09/docker-au"...) = 0 <0.000504>
> [pid 22756] 1556919088.687643 mount("none", "/mnt/volume_sfo2_09/docker-aufs/aufs/mnt/a86f8c9dd0ec2486293119c20b0ec026e19bbc4d51332c554f7cf05d777c9866", 0xc000c451b0, MS_REMOUNT, "append:/mnt/volume_sfo2_09/docke"...) = 0 <0.000105>
> [pid 22756] 1556919088.687851 mount("none", "/mnt/volume_sfo2_09/docker-aufs/aufs/mnt/a86f8c9dd0ec2486293119c20b0ec026e19bbc4d51332c554f7cf05d777c9866", 0xc000c451ba, MS_REMOUNT, "append:/mnt/volume_sfo2_09/docke"...) = 0 <0.000098>
> ..... (~30 lines skipped for clarity)
> [pid 22756] 1556919088.696182 mount("none", "/mnt/volume_sfo2_09/docker-aufs/aufs/mnt/a86f8c9dd0ec2486293119c20b0ec026e19bbc4d51332c554f7cf05d777c9866", 0xc000c45310, MS_REMOUNT, "append:/mnt/volume_sfo2_09/docke"...) = 0 <0.000266>
After (2 syscalls):
> [pid 24352] 1556919361.799889 mount("none", "/mnt/volume_sfo2_09/docker-aufs/aufs/mnt/8e7ba189e347a834e99eea4ed568f95b86cec809c227516afdc7c70286ff9a20", "aufs", 0, "br:/mnt/volume_sfo2_09/docker-au"...) = 0 <0.001717>
> [pid 24352] 1556919361.801761 mount("none", "/mnt/volume_sfo2_09/docker-aufs/aufs/mnt/8e7ba189e347a834e99eea4ed568f95b86cec809c227516afdc7c70286ff9a20", 0xc000dbecb0, MS_REMOUNT, "append:/mnt/volume_sfo2_09/docke"...) = 0 <0.001358>
Signed-off-by: Kir Kolyshkin <[email protected]>
Do not use filepath.Walk() as there's no requirement to recursively go into every directory under mnt -- a (non-recursive) list of directories in mnt is sufficient. With filepath.Walk(), in case some container will fail to unmount, it'll go through the whole container filesystem which is both excessive and useless. This is similar to commit f1a4592 ("devmapper.shutdown: optimize") While at it, raise the priority of "unmount error" message from debug to a warning. Note we don't have to explicitly add `m` as unmount error (from pkg/mount) will have it. Signed-off-by: Kir Kolyshkin <[email protected]>
Running a bundled aufs benchmark sometimes results in this warning:
> WARN[0001] Couldn't run auplink before unmount /tmp/aufs-tests/aufs/mnt/XXXXX error="exit status 22" storage-driver=aufs
If we take a look at what aulink utility produces on stderr, we'll see:
> auplink:proc_mnt.c:96: /tmp/aufs-tests/aufs/mnt/XXXXX: Invalid argument
and auplink exits with exit code of 22 (EINVAL).
Looking into auplink source code, what happens is it tries to find a
record in /proc/self/mounts corresponding to the mount point (by using
setmntent()/getmntent_r() glibc functions), and it fails.
Some manual testing, as well as runtime testing with lots of printf
added on mount/unmount, as well as calls to check the superblock fs
magic on mount point (as in graphdriver.Mounted(graphdriver.FsMagicAufs, target)
confirmed that this record is in fact there, but sometimes auplink
can't find it. I was also able to reproduce the same error (inability
to find a mount in /proc/self/mounts that should definitely be there)
using a small C program, mocking what `auplink` does:
```c
#include <stdio.h>
#include <err.h>
#include <mntent.h>
#include <string.h>
#include <stdlib.h>
int main(int argc, char **argv)
{
FILE *fp;
struct mntent m, *p;
char a[4096];
char buf[4096 + 1024];
int found =0, lines = 0;
if (argc != 2) {
fprintf(stderr, "Usage: %s <mountpoint>\n", argv[0]);
exit(1);
}
fp = setmntent("/proc/self/mounts", "r");
if (!fp) {
err(1, "setmntent");
}
setvbuf(fp, a, _IOLBF, sizeof(a));
while ((p = getmntent_r(fp, &m, buf, sizeof(buf)))) {
lines++;
if (!strcmp(p->mnt_dir, argv[1])) {
found++;
}
}
printf("found %d entries for %s (%d lines seen)\n", found, argv[1], lines);
return !found;
}
```
I have also wrote a few other C proggies -- one that reads
/proc/self/mounts directly, one that reads /proc/self/mountinfo instead.
They are also prone to the same occasional error.
It is not perfectly clear why this happens, but so far my best theory
is when a lot of mounts/unmounts happen in parallel with reading
contents of /proc/self/mounts, sometimes the kernel fails to provide
continuity (i.e. it skips some part of file or mixes it up in some
other way). In other words, this is a kernel bug (which is probably
hard to fix unless some other interface to get a mount entry is added).
Now, there is no real fix, and a workaround I was able to come up
with is to retry when we got EINVAL. It usually works on the second
attempt, although I've once seen it took two attempts to go through.
Signed-off-by: Kir Kolyshkin <[email protected]>
|
Rebased to include #39243 |
andrewhsu
left a comment
There was a problem hiding this comment.
SGTM w.r.t. getting perf optimizations for aufs; much desirable
|
OK, the test I wrote (https://github.com/kolyshkin/procfs-test) convinced aufs author to include my patch (that retries reading of /proc/self/mounts in case the mount can't be found). Nevertheless it makes sense to have a retry here as well, since we don't ship |
|
OK, I have finally figured out why two aufs mounts might race (and so I had to add a lock around Aside from using lock around mount, which is what this PR does, there are other ways to work around this.
I'm going to go with approach 1. |
|
I did something like this: diff --git a/daemon/graphdriver/aufs/aufs.go b/daemon/graphdriver/aufs/aufs.go
index 9c4e28e..2fd8c8a 100644
--- a/daemon/graphdriver/aufs/aufs.go
+++ b/daemon/graphdriver/aufs/aufs.go
@@ -28,10 +28,12 @@ import (
"fmt"
"io"
"io/ioutil"
+ "math/rand"
"os"
"os/exec"
"path"
"path/filepath"
+ "strconv"
"strings"
"sync"
"time"
@@ -80,7 +82,6 @@ type Driver struct {
pathCache map[string]string
naiveDiff graphdriver.DiffDriver
locker *locker.Locker
- mntL sync.Mutex
}
// Init returns a new AUFS driver.
@@ -618,14 +619,13 @@ func (a *Driver) aufsMount(ro []string, rw, target, mountLabel string) (err erro
bp += copy(b[bp:], layer)
}
- opts := "dio,xino=/dev/shm/aufs.xino"
+ rnd := strconv.FormatInt(int64(1e9+rand.Uint32()%1e9), 36)[1:5]
+ opts := "dio,xino=/dev/shm/aufs."+rnd
if useDirperm() {
opts += ",dirperm1"
}
data := label.FormatMountLabel(fmt.Sprintf("%s,%s", string(b[:bp]), opts), mountLabel)
- a.mntL.Lock()
err = unix.Mount("none", target, "aufs", 0, data)
- a.mntL.Unlock()
if err != nil {
err = errors.Wrap(err, "mount target="+target+" data="+data)
return
@@ -641,9 +641,7 @@ func (a *Driver) aufsMount(ro []string, rw, target, mountLabel string) (err erro
bp += copy(b[bp:], layer)
}
data := label.FormatMountLabel(string(b[:bp]), mountLabel)
- a.mntL.Lock()
err = unix.Mount("none", target, "aufs", unix.MS_REMOUNT, data)
- a.mntL.Unlock()
if err != nil {
err = errors.Wrap(err, "mount target="+target+" flags=MS_REMOUNT data="+data)
returnbut it didn't change anything from the performance point of view. Not sure which way is better. |
[18.09 backport ENGCORE-830] aufs optimizations moby#39107
[19.03 backport ENGCORE-831] aufs optimizations moby#39107
Fixes CVE-2018-15664: API endpoints behind the 'docker cp' command are vulnerable to a symlink-exchange attack with Directory Traversal, giving attackers arbitrary read-write access to the host filesystem with root privileges, because daemon/archive.go does not do archive operations on a frozen filesystem (or from within a chroot). And includes additional post-18.09.6 fixes: Builder - Fixed a panic error when building dockerfiles that contain only comments. moby/moby#38487 - Added a workaround for GCR authentication issue. moby/moby#38246 - Builder-next: Fixed a bug in the GCR token cache implementation workaround. moby/moby#39183 Runtime - Added performance optimizations in aufs and layer store that helps in massively parallel container creation and removal. moby/moby#39107, moby/moby#39135 - daemon: fixed a mirrors validation issue. moby/moby#38991 - Docker no longer supports sorting UID and GID ranges in ID maps. moby/moby#39288 Logging - Added a fix that now allows large log lines for logger plugins. moby/moby#39038 Signed-off-by: Peter Korsgaard <[email protected]> Signed-off-by: Arnout Vandecappelle (Essensium/Mind) <[email protected]>
Fixes CVE-2018-15664: API endpoints behind the 'docker cp' command are vulnerable to a symlink-exchange attack with Directory Traversal, giving attackers arbitrary read-write access to the host filesystem with root privileges, because daemon/archive.go does not do archive operations on a frozen filesystem (or from within a chroot). And includes additional post-18.09.6 fixes: Builder - Fixed a panic error when building dockerfiles that contain only comments. moby/moby#38487 - Added a workaround for GCR authentication issue. moby/moby#38246 - Builder-next: Fixed a bug in the GCR token cache implementation workaround. moby/moby#39183 Runtime - Added performance optimizations in aufs and layer store that helps in massively parallel container creation and removal. moby/moby#39107, moby/moby#39135 - daemon: fixed a mirrors validation issue. moby/moby#38991 - Docker no longer supports sorting UID and GID ranges in ID maps. moby/moby#39288 Logging - Added a fix that now allows large log lines for logger plugins. moby/moby#39038 Signed-off-by: Peter Korsgaard <[email protected]> Signed-off-by: Arnout Vandecappelle (Essensium/Mind) <[email protected]>
Fixes CVE-2018-15664: API endpoints behind the 'docker cp' command are vulnerable to a symlink-exchange attack with Directory Traversal, giving attackers arbitrary read-write access to the host filesystem with root privileges, because daemon/archive.go does not do archive operations on a frozen filesystem (or from within a chroot). And includes additional post-18.09.6 fixes: Builder - Fixed a panic error when building dockerfiles that contain only comments. moby/moby#38487 - Added a workaround for GCR authentication issue. moby/moby#38246 - Builder-next: Fixed a bug in the GCR token cache implementation workaround. moby/moby#39183 Runtime - Added performance optimizations in aufs and layer store that helps in massively parallel container creation and removal. moby/moby#39107, moby/moby#39135 - daemon: fixed a mirrors validation issue. moby/moby#38991 - Docker no longer supports sorting UID and GID ranges in ID maps. moby/moby#39288 Logging - Added a fix that now allows large log lines for logger plugins. moby/moby#39038 Signed-off-by: Peter Korsgaard <[email protected]> Signed-off-by: Arnout Vandecappelle (Essensium/Mind) <[email protected]>
Fixes CVE-2018-15664: API endpoints behind the 'docker cp' command are vulnerable to a symlink-exchange attack with Directory Traversal, giving attackers arbitrary read-write access to the host filesystem with root privileges, because daemon/archive.go does not do archive operations on a frozen filesystem (or from within a chroot). And includes additional post-18.09.6 fixes: Builder - Fixed a panic error when building dockerfiles that contain only comments. moby/moby#38487 - Added a workaround for GCR authentication issue. moby/moby#38246 - Builder-next: Fixed a bug in the GCR token cache implementation workaround. moby/moby#39183 Runtime - Added performance optimizations in aufs and layer store that helps in massively parallel container creation and removal. moby/moby#39107, moby/moby#39135 - daemon: fixed a mirrors validation issue. moby/moby#38991 - Docker no longer supports sorting UID and GID ranges in ID maps. moby/moby#39288 Logging - Added a fix that now allows large log lines for logger plugins. moby/moby#39038 Signed-off-by: Peter Korsgaard <[email protected]> Signed-off-by: Arnout Vandecappelle (Essensium/Mind) <[email protected]>
Fixes CVE-2018-15664: API endpoints behind the 'docker cp' command are vulnerable to a symlink-exchange attack with Directory Traversal, giving attackers arbitrary read-write access to the host filesystem with root privileges, because daemon/archive.go does not do archive operations on a frozen filesystem (or from within a chroot). And includes additional post-18.09.6 fixes: Builder - Fixed a panic error when building dockerfiles that contain only comments. moby/moby#38487 - Added a workaround for GCR authentication issue. moby/moby#38246 - Builder-next: Fixed a bug in the GCR token cache implementation workaround. moby/moby#39183 Runtime - Added performance optimizations in aufs and layer store that helps in massively parallel container creation and removal. moby/moby#39107, moby/moby#39135 - daemon: fixed a mirrors validation issue. moby/moby#38991 - Docker no longer supports sorting UID and GID ranges in ID maps. moby/moby#39288 Logging - Added a fix that now allows large log lines for logger plugins. moby/moby#39038 Signed-off-by: Peter Korsgaard <[email protected]> Signed-off-by: Arnout Vandecappelle (Essensium/Mind) <[email protected]> (cherry picked from commit 13cf6f0) Signed-off-by: Peter Korsgaard <[email protected]>
Fixes CVE-2018-15664: API endpoints behind the 'docker cp' command are vulnerable to a symlink-exchange attack with Directory Traversal, giving attackers arbitrary read-write access to the host filesystem with root privileges, because daemon/archive.go does not do archive operations on a frozen filesystem (or from within a chroot). And includes additional post-18.09.6 fixes: Builder - Fixed a panic error when building dockerfiles that contain only comments. moby/moby#38487 - Added a workaround for GCR authentication issue. moby/moby#38246 - Builder-next: Fixed a bug in the GCR token cache implementation workaround. moby/moby#39183 Runtime - Added performance optimizations in aufs and layer store that helps in massively parallel container creation and removal. moby/moby#39107, moby/moby#39135 - daemon: fixed a mirrors validation issue. moby/moby#38991 - Docker no longer supports sorting UID and GID ranges in ID maps. moby/moby#39288 Logging - Added a fix that now allows large log lines for logger plugins. moby/moby#39038 Signed-off-by: Peter Korsgaard <[email protected]> Signed-off-by: Arnout Vandecappelle (Essensium/Mind) <[email protected]> (cherry picked from commit cdbb3ce) Signed-off-by: Peter Korsgaard <[email protected]>
Fixes CVE-2018-15664: API endpoints behind the 'docker cp' command are vulnerable to a symlink-exchange attack with Directory Traversal, giving attackers arbitrary read-write access to the host filesystem with root privileges, because daemon/archive.go does not do archive operations on a frozen filesystem (or from within a chroot). And includes additional post-18.09.6 fixes: Builder - Fixed a panic error when building dockerfiles that contain only comments. moby/moby#38487 - Added a workaround for GCR authentication issue. moby/moby#38246 - Builder-next: Fixed a bug in the GCR token cache implementation workaround. moby/moby#39183 Runtime - Added performance optimizations in aufs and layer store that helps in massively parallel container creation and removal. moby/moby#39107, moby/moby#39135 - daemon: fixed a mirrors validation issue. moby/moby#38991 - Docker no longer supports sorting UID and GID ranges in ID maps. moby/moby#39288 Logging - Added a fix that now allows large log lines for logger plugins. moby/moby#39038 Signed-off-by: Peter Korsgaard <[email protected]> Signed-off-by: Arnout Vandecappelle (Essensium/Mind) <[email protected]> (cherry picked from commit 13cf6f0) Signed-off-by: Peter Korsgaard <[email protected]>
Fixes CVE-2018-15664: API endpoints behind the 'docker cp' command are vulnerable to a symlink-exchange attack with Directory Traversal, giving attackers arbitrary read-write access to the host filesystem with root privileges, because daemon/archive.go does not do archive operations on a frozen filesystem (or from within a chroot). And includes additional post-18.09.6 fixes: Builder - Fixed a panic error when building dockerfiles that contain only comments. moby/moby#38487 - Added a workaround for GCR authentication issue. moby/moby#38246 - Builder-next: Fixed a bug in the GCR token cache implementation workaround. moby/moby#39183 Runtime - Added performance optimizations in aufs and layer store that helps in massively parallel container creation and removal. moby/moby#39107, moby/moby#39135 - daemon: fixed a mirrors validation issue. moby/moby#38991 - Docker no longer supports sorting UID and GID ranges in ID maps. moby/moby#39288 Logging - Added a fix that now allows large log lines for logger plugins. moby/moby#39038 Signed-off-by: Peter Korsgaard <[email protected]> Signed-off-by: Arnout Vandecappelle (Essensium/Mind) <[email protected]> (cherry picked from commit cdbb3ce) Signed-off-by: Peter Korsgaard <[email protected]>
For detailed description, see commit messages.
Performance and stress testing
This was tested by the following script on a machine with 8GB RAM (with more RAM you can increase the value of
PARALLEL):Results for the above tests, measured at least 3 times (with #39135 applied, with and without this PR):
No errors were detected.