What is the problem you're trying to solve
Linux has for several years offered a new API for defining/constructing mounts. This API is probably preferable for programmatic mount creation for simple reasons like moving away from constructing and interpreting lengthy strings. More importantly, it enables types beyond strings to be a part of the conversation between userspace and the kernel when describing a new mount.
An extension to this new mount API of particular interest to containerd is being readied. In Linux 6.13, it is nearly possible to provide the layers of an overlayfs as file descriptors instead of as path strings; it was attempted but some minor fixes/improvements have been identified that seem likely to land by 6.15, see link. Using this new API would be a benefit to containerd, because when user namespaces are enabled, we must create a new mount for each layer of the overlayfses we assemble in order to apply user idmapping. Under the old mount API, each of these usually multiple mounts per container must be attached to the host’s directory tree somewhere so that they can be pointed to as a string mount option to overlayfs. We have found these attached temporary mounts problematic to reliably clean up (see #10704 and its several linked PRs; the end state after #10955 will leak mounts when our umount fails with EBUSY). Furthermore, reducing the visibility of these temporary mounts from “visible to anyone reading the mounts list” to “an fd only containerd and the kernel knows about” should provide better isolation between containerd and the kernel.
For these reasons I believe now is a good time to refactor to enable containerd to use the new Linux mount API. This probably involves enough changes to begin work in parallel with the final kernel issues being ironed out.
Describe the solution you'd like
Here’s my plan – as a relatively new contributor I would appreciate suggestions, improvements, or simple awareness from maintainers and others before dropping an actual PR. Maybe we should discuss totally different approaches, such as leaving the main Mount type/ code paths alone and creating a more divergent code path specifically for the idmapped overlayfs case.
Proposal: Support defining a given mount’s options as either a set of traditional strings or as a set of typed parameters
This would look like adjusting the Mount struct’s Options []string to an Options []FsParam, where each FsParam is an interface that may resolve to a traditional opaque string, or may resolve to a key and its typed value. Supported OSes that do not support typed mount parameters would simply be refactored to provide FsParams of traditional strings; similarly any code paths in Linux where using the new mount API isn’t of particular importance would be minimally refactored in the same way.
When the FsParams are of the key/typed-value variety, the mount would ultimately be performed using the new mount API, with keys being passed to fsconfig, the Go type of the value determining the “command” to that call, and the value being passed to fsconfig. It would be invalid for a given Mount’s Options to include a mix of traditional-string and key/typed-value options; e.g. either the old or the new mount API would be used and we would not try to mix the two.
With some change like this in place it would then be possible to directly tackle utilizing the new mount API for idmapped overlayfs layers as a follow-on.
Additional context
cc: @rata @fuweid
What is the problem you're trying to solve
Linux has for several years offered a new API for defining/constructing mounts. This API is probably preferable for programmatic mount creation for simple reasons like moving away from constructing and interpreting lengthy strings. More importantly, it enables types beyond strings to be a part of the conversation between userspace and the kernel when describing a new mount.
An extension to this new mount API of particular interest to containerd is being readied. In Linux 6.13, it is nearly possible to provide the layers of an overlayfs as file descriptors instead of as path strings; it was attempted but some minor fixes/improvements have been identified that seem likely to land by 6.15, see link. Using this new API would be a benefit to containerd, because when user namespaces are enabled, we must create a new mount for each layer of the overlayfses we assemble in order to apply user idmapping. Under the old mount API, each of these usually multiple mounts per container must be attached to the host’s directory tree somewhere so that they can be pointed to as a string mount option to overlayfs. We have found these attached temporary mounts problematic to reliably clean up (see #10704 and its several linked PRs; the end state after #10955 will leak mounts when our
umountfails with EBUSY). Furthermore, reducing the visibility of these temporary mounts from “visible to anyone reading the mounts list” to “an fd only containerd and the kernel knows about” should provide better isolation between containerd and the kernel.For these reasons I believe now is a good time to refactor to enable containerd to use the new Linux mount API. This probably involves enough changes to begin work in parallel with the final kernel issues being ironed out.
Describe the solution you'd like
Here’s my plan – as a relatively new contributor I would appreciate suggestions, improvements, or simple awareness from maintainers and others before dropping an actual PR. Maybe we should discuss totally different approaches, such as leaving the main Mount type/ code paths alone and creating a more divergent code path specifically for the idmapped overlayfs case.
Proposal: Support defining a given mount’s options as either a set of traditional strings or as a set of typed parameters
This would look like adjusting the Mount struct’s
Options []stringto anOptions []FsParam, where eachFsParamis an interface that may resolve to a traditional opaque string, or may resolve to a key and its typed value. Supported OSes that do not support typed mount parameters would simply be refactored to provide FsParams of traditional strings; similarly any code paths in Linux where using the new mount API isn’t of particular importance would be minimally refactored in the same way.When the FsParams are of the key/typed-value variety, the mount would ultimately be performed using the new mount API, with keys being passed to fsconfig, the Go type of the value determining the “command” to that call, and the value being passed to fsconfig. It would be invalid for a given Mount’s Options to include a mix of traditional-string and key/typed-value options; e.g. either the old or the new mount API would be used and we would not try to mix the two.
With some change like this in place it would then be possible to directly tackle utilizing the new mount API for idmapped overlayfs layers as a follow-on.
Additional context
cc: @rata @fuweid