Description
When running ~100 pods around the same time using an image that has ~50 layers (large, I know) containerd will often become hung up enough that it can timeout on CRI's CreateContainer. systemd also gets busy as well and fails to respond on dbus in a timely manner (> 30 seconds). This results in failing pods, bad node health, etc.
It seems most time is spent in prepareIDMappedOverlay, which is doing all the bind mount idmap'ings to shift things to the appropriate user for each layer in the container. Doing this on the host's mount namespace means that systemd sees all these mounts and spends tons of time processing them. There's a few issues around this in systemd, with discussions surrounding using newer fanotify/listmount syscalls to improve this. The general consensus though is that dramatically increasing the global mount table size should be avoided:
systemd/systemd#33186 (comment)
systemd/systemd#31137
Outside of systemd this load is enough to really bog down containerd while fighting over various kernel resources, resulting in the CRI timeouts I alluded to.
Steps to reproduce the issue
- Launch many containers using an image with many layers
Describe the results you received and expected
I was expecting these workloads to run fine, as they worked without user namespaces enabled prior!
What version of containerd are you using?
v.2.0.5
Any other relevant information
No response
Show configuration if it is related to CRI plugin.
No response
Description
When running ~100 pods around the same time using an image that has ~50 layers (large, I know) containerd will often become hung up enough that it can timeout on CRI's
CreateContainer. systemd also gets busy as well and fails to respond on dbus in a timely manner (> 30 seconds). This results in failing pods, bad node health, etc.It seems most time is spent in
prepareIDMappedOverlay, which is doing all the bind mount idmap'ings to shift things to the appropriate user for each layer in the container. Doing this on the host's mount namespace means that systemd sees all these mounts and spends tons of time processing them. There's a few issues around this in systemd, with discussions surrounding using newerfanotify/listmountsyscalls to improve this. The general consensus though is that dramatically increasing the global mount table size should be avoided:systemd/systemd#33186 (comment)
systemd/systemd#31137
Outside of systemd this load is enough to really bog down containerd while fighting over various kernel resources, resulting in the CRI timeouts I alluded to.
Steps to reproduce the issue
Describe the results you received and expected
I was expecting these workloads to run fine, as they worked without user namespaces enabled prior!
What version of containerd are you using?
v.2.0.5
Any other relevant information
No response
Show configuration if it is related to CRI plugin.
No response