Skip to content

starting many user namespace enabled pods at once causes bad mount performance #12048

@halaney

Description

@halaney

Description

When running ~100 pods around the same time using an image that has ~50 layers (large, I know) containerd will often become hung up enough that it can timeout on CRI's CreateContainer. systemd also gets busy as well and fails to respond on dbus in a timely manner (> 30 seconds). This results in failing pods, bad node health, etc.

It seems most time is spent in prepareIDMappedOverlay, which is doing all the bind mount idmap'ings to shift things to the appropriate user for each layer in the container. Doing this on the host's mount namespace means that systemd sees all these mounts and spends tons of time processing them. There's a few issues around this in systemd, with discussions surrounding using newer fanotify/listmount syscalls to improve this. The general consensus though is that dramatically increasing the global mount table size should be avoided:
systemd/systemd#33186 (comment)
systemd/systemd#31137

Outside of systemd this load is enough to really bog down containerd while fighting over various kernel resources, resulting in the CRI timeouts I alluded to.

Steps to reproduce the issue

  1. Launch many containers using an image with many layers

Describe the results you received and expected

I was expecting these workloads to run fine, as they worked without user namespaces enabled prior!

What version of containerd are you using?

v.2.0.5

Any other relevant information

No response

Show configuration if it is related to CRI plugin.

No response

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    Status

    Done

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions