Issue found via #2324: starting a pod with 100 apps is really slow. The problem goes away when testing with this patch:
--- a/stage1/prepare-app/prepare-app.c
+++ b/stage1/prepare-app/prepare-app.c
@@ -152,7 +152,7 @@ int main(int argc, char *argv[])
};
static const mount_point dirs_mount_table[] = {
{ "/proc", "/proc", "bind", NULL, MS_BIND|MS_REC },
- { "/sys", "/sys", "bind", NULL, MS_BIND|MS_REC },
+ { "/sys", "/sys", "bind", NULL, MS_BIND },
{ "/dev/shm", "/dev/shm", "bind", NULL, MS_BIND },
{ "/dev/pts", "/dev/pts", "bind", NULL, MS_BIND },
{ "/run/systemd/journal", "/run/systemd/journal", "bind", NULL, MS_BIND },
prepare-app recursively bind-mounts /sys from stage1 to the stage2 (apps' rootfs). "Recursive" means it includes all the cgroup mounts because they are in /sys/fs/cgroup. Moreover, rkt bind mounts some cgroup knob files in the cgroup filesystem for enabling the memory and cpu isolator.
The number of cgroup bind mounts in stage1 is linear with the number of apps: O(n)
The number of cgroup bind mounts in stage2 is quadratic with the number of apps: O(n^2)
With one app, I have 17 bind mounts related to cgroups. With 100 apps, 17 * 100 * 100 = 170.000 bind mounts.
For each change in the mount table, systemd is notified via inotify on /proc/self/mountinfo and it checks the configuration of that mount in /etc/systemd/system, /run/systemd/system, /usr/local/lib/systemd/system and /usr/lib64/systemd/system.
systemd does about 30 syscalls per new mount notified via /proc/self/mountinfo. That would be 5.100.000 syscalls for mounting cgroups in a 100-app pod.