stage1/prepare-app: avoid recursive bind-mounts on /sys#2386
stage1/prepare-app: avoid recursive bind-mounts on /sys#2386alban merged 1 commit intorkt:masterfrom
Conversation
|
One could have the concern that apps relying on filesystems inside However, the only submount that rkt has available in stage1 is |
|
Or maybe we should consider mounting the controllers RO in `/sys/fs/cgroup... |
|
Ideally, the app should only see its own cgroups? But seems to take more work to fix at this moment.. |
So if the cadvisor pod mounts the host's /sys/fs/cgroup itself, will that work after this PR? |
|
@yifan-gu isn't cadvisor started by rkt fly? Prepare-app is not used by rkt fly, so this PR should not change anything. If not using rkt fly, I am not sure if mounting /sys/fs/cgroup as a volume would work because that would be mounted by nspawn before prepare-app mounts /sys, hiding the previous subdirectories... |
|
@alban For now, cadvisor is not running standalone as a container, in fact it is integrated with kubelet as a library. Kubelet is running as a process or by rkt fly. |
0e48f74 to
ba78630
Compare
|
@s-urbaniak @iaguis Patch updated. PTAL. |
stage1/prepare-app/prepare-app.c
Outdated
| { "/dev/shm", "/dev/shm", "bind", NULL, MS_BIND }, | ||
| { "/dev/pts", "/dev/pts", "bind", NULL, MS_BIND }, | ||
| { "/run/systemd/journal", "/run/systemd/journal", "bind", NULL, MS_BIND }, | ||
| /* /sys is handled separatly */ |
ba78630 to
0eb4064
Compare
|
|
||
| pexit_if(statfs("/sys/fs/cgroup", &fs) != 0, | ||
| "Cannot statfs /sys/fs/cgroup"); | ||
| if (fs.f_type == (typeof(fs.f_type)) CGROUP2_SUPER_MAGIC) { |
There was a problem hiding this comment.
maybe only add the MS_REC flag here instead of repeating the code below.
There was a problem hiding this comment.
oh, I see that you return right away, then it's alright I guess
|
One comment but LGTM after that |
0eb4064 to
225aa46
Compare
225aa46 to
4194839
Compare
Starting a pod with 100 apps is really slow. This patch fixes the slowness. prepare-app recursively bind-mounts /sys from stage1 to the stage2 (apps' rootfs). "Recursive" means it includes all the cgroup mounts because they are in /sys/fs/cgroup. Moreover, rkt bind mounts some cgroup knob files in the cgroup filesystem for enabling the memory and cpu isolator. The number of cgroup bind mounts in stage1 is linear with the number of apps: O(n) The number of cgroup bind mounts in stage2 is quadratic with the number of apps: O(n^2) With one app, I have 17 bind mounts related to cgroups. With 100 apps, 17 * 100 * 100 = 170.000 bind mounts. For each change in the mount table, systemd is notified via inotify on /proc/self/mountinfo and it checks the configuration of that mount in /etc/systemd/system, /run/systemd/system, /usr/local/lib/systemd/system and /usr/lib64/systemd/system. systemd does about 30 syscalls per new mount notified via /proc/self/mountinfo. That was 5.100.000 syscalls for mounting cgroups in a 100-app pod. This patch changes the logic to bind mount /sys: - in the usual non-unified cgroup hierarchy: bind mount /sys non-recursively and instead bind-mount each cgroup hierarchy individually, non-recursively. - in the unified cgroup hierarchy: keep the recursive bind mount because we will not have per-app bind mounts on knob files in this case.
|
LGTM |
Starting a pod with 100 apps is really slow. This patch fixes the
slowness.
prepare-app recursively bind-mounts /sys from stage1 to the stage2
(apps' rootfs). "Recursive" means it includes all the cgroup mounts
because they are in /sys/fs/cgroup. Moreover, rkt bind mounts some
cgroup knob files in the cgroup filesystem for enabling the memory and
cpu isolator.
The number of cgroup bind mounts in stage1 is linear with the number of
apps: O(n)
The number of cgroup bind mounts in stage2 is quadratic with the number
of apps: O(n^2)
With one app, I have 17 bind mounts related to cgroups. With 100 apps,
17 * 100 * 100 = 170.000 bind mounts.
For each change in the mount table, systemd is notified via inotify on
/proc/self/mountinfo and it checks the configuration of that mount in
/etc/systemd/system, /run/systemd/system, /usr/local/lib/systemd/system
and /usr/lib64/systemd/system.
systemd does about 30 syscalls per new mount notified via
/proc/self/mountinfo. That was 5.100.000 syscalls for mounting
cgroups in a 100-app pod.
This patch changes the logic to bind mount /sys:
non-recursively and instead bind-mount each cgroup hierarchy
individually, non-recursively.
we will not have per-app bind mounts on knob files in this case.
Fixes #2351
/cc @s-urbaniak