Skip to content
This repository was archived by the owner on Feb 24, 2020. It is now read-only.

stage1/prepare-app: avoid recursive bind-mounts on /sys#2386

Merged
alban merged 1 commit intorkt:masterfrom
kinvolk:alban/pod-100-apps
Apr 20, 2016
Merged

stage1/prepare-app: avoid recursive bind-mounts on /sys#2386
alban merged 1 commit intorkt:masterfrom
kinvolk:alban/pod-100-apps

Conversation

@alban
Copy link
Member

@alban alban commented Apr 6, 2016

Starting a pod with 100 apps is really slow. This patch fixes the
slowness.

prepare-app recursively bind-mounts /sys from stage1 to the stage2
(apps' rootfs). "Recursive" means it includes all the cgroup mounts
because they are in /sys/fs/cgroup. Moreover, rkt bind mounts some
cgroup knob files in the cgroup filesystem for enabling the memory and
cpu isolator.

The number of cgroup bind mounts in stage1 is linear with the number of
apps: O(n)

The number of cgroup bind mounts in stage2 is quadratic with the number
of apps: O(n^2)

With one app, I have 17 bind mounts related to cgroups. With 100 apps,
17 * 100 * 100 = 170.000 bind mounts.

For each change in the mount table, systemd is notified via inotify on
/proc/self/mountinfo and it checks the configuration of that mount in
/etc/systemd/system, /run/systemd/system, /usr/local/lib/systemd/system
and /usr/lib64/systemd/system.

systemd does about 30 syscalls per new mount notified via
/proc/self/mountinfo. That was 5.100.000 syscalls for mounting
cgroups in a 100-app pod.

This patch changes the logic to bind mount /sys:

  • in the usual non-unified cgroup hierarchy: bind mount /sys
    non-recursively and instead bind-mount each cgroup hierarchy
    individually, non-recursively.
  • in the unified cgroup hierarchy: keep the recursive bind mount because
    we will not have per-app bind mounts on knob files in this case.

Fixes #2351


/cc @s-urbaniak

@iaguis
Copy link
Member

iaguis commented Apr 6, 2016

One could have the concern that apps relying on filesystems inside /sys would be broken.

However, the only submount that rkt has available in stage1 is /sys/fs/cgroup. I don't think apps need access to that (unless you want to run cadvisor in a rkt container?).

@iaguis
Copy link
Member

iaguis commented Apr 11, 2016

Or maybe we should consider mounting the controllers RO in `/sys/fs/cgroup...

@yifan-gu
Copy link
Contributor

yifan-gu commented Apr 15, 2016

Ideally, the app should only see its own cgroups? But seems to take more work to fix at this moment..

@yifan-gu
Copy link
Contributor

yifan-gu commented Apr 15, 2016

However, the only submount that rkt has available in stage1 is /sys/fs/cgroup. I don't think apps need access to that (unless you want to run cadvisor in a rkt container?).

So if the cadvisor pod mounts the host's /sys/fs/cgroup itself, will that work after this PR?

@alban
Copy link
Member Author

alban commented Apr 18, 2016

@yifan-gu isn't cadvisor started by rkt fly? Prepare-app is not used by rkt fly, so this PR should not change anything. If not using rkt fly, I am not sure if mounting /sys/fs/cgroup as a volume would work because that would be mounted by nspawn before prepare-app mounts /sys, hiding the previous subdirectories...

@yifan-gu
Copy link
Contributor

@alban For now, cadvisor is not running standalone as a container, in fact it is integrated with kubelet as a library. Kubelet is running as a process or by rkt fly.

@alban alban self-assigned this Apr 18, 2016
@alban alban force-pushed the alban/pod-100-apps branch from 0e48f74 to ba78630 Compare April 20, 2016 09:43
@alban
Copy link
Member Author

alban commented Apr 20, 2016

@s-urbaniak @iaguis Patch updated. PTAL.

{ "/dev/shm", "/dev/shm", "bind", NULL, MS_BIND },
{ "/dev/pts", "/dev/pts", "bind", NULL, MS_BIND },
{ "/run/systemd/journal", "/run/systemd/journal", "bind", NULL, MS_BIND },
/* /sys is handled separatly */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

separately

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed


pexit_if(statfs("/sys/fs/cgroup", &fs) != 0,
"Cannot statfs /sys/fs/cgroup");
if (fs.f_type == (typeof(fs.f_type)) CGROUP2_SUPER_MAGIC) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe only add the MS_REC flag here instead of repeating the code below.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, I see that you return right away, then it's alright I guess

@iaguis
Copy link
Member

iaguis commented Apr 20, 2016

One comment but LGTM after that

@alban alban force-pushed the alban/pod-100-apps branch from 0eb4064 to 225aa46 Compare April 20, 2016 14:16
@alban
Copy link
Member Author

alban commented Apr 20, 2016

@iaguis Updated. Tested with too-many-apps.podmanifest from #2324. PTAL.

@alban alban force-pushed the alban/pod-100-apps branch from 225aa46 to 4194839 Compare April 20, 2016 14:20
Starting a pod with 100 apps is really slow. This patch fixes the
slowness.

prepare-app recursively bind-mounts /sys from stage1 to the stage2
(apps' rootfs). "Recursive" means it includes all the cgroup mounts
because they are in /sys/fs/cgroup. Moreover, rkt bind mounts some
cgroup knob files in the cgroup filesystem for enabling the memory and
cpu isolator.

The number of cgroup bind mounts in stage1 is linear with the number of
apps: O(n)

The number of cgroup bind mounts in stage2 is quadratic with the number
of apps: O(n^2)

With one app, I have 17 bind mounts related to cgroups. With 100 apps,
17 * 100 * 100 = 170.000 bind mounts.

For each change in the mount table, systemd is notified via inotify on
/proc/self/mountinfo and it checks the configuration of that mount in
/etc/systemd/system, /run/systemd/system, /usr/local/lib/systemd/system
and /usr/lib64/systemd/system.

systemd does about 30 syscalls per new mount notified via
/proc/self/mountinfo. That was 5.100.000 syscalls for mounting
cgroups in a 100-app pod.

This patch changes the logic to bind mount /sys:
- in the usual non-unified cgroup hierarchy: bind mount /sys
  non-recursively and instead bind-mount each cgroup hierarchy
  individually, non-recursively.
- in the unified cgroup hierarchy: keep the recursive bind mount because
  we will not have per-app bind mounts on knob files in this case.
@iaguis
Copy link
Member

iaguis commented Apr 20, 2016

LGTM

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

prepare-app: perf issue with mounting /sys (benchmarks)

3 participants