[bug & proposal] overlayfs support more-layers-image #2497

fuweid · 2018-07-25T07:52:42Z

Bug

Description

The number of layers is limited by the max size of the mount option buffer in the kernel (1 page/4096bytes in general). For now, containerd uses the absolute path of snapshoter. Basically, the root path is like /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots[68]. We cannot pull image which has almost 60 layers (~ 4096/68).

Reproduce

From busybox:latest

# repeat the following line with 60 times
RUN mktemp

ctr pull ${imageName} and you will got:

INFO[0001] apply failure, attempting cleanup             error="failed to extract layer sha256:0f87b58539d6201873662f4d4c1c7f4097367770530b6451df54f3539231b26e: failed to mount /var/lib/containerd/tmpmounts/containerd-mount550654010: no such file or directory: unknown" key="extract-982113456-BiJs sha256:a4de2c26d7e33d26f51b87a4637f312edc3da2fc2f566f9485c774d8549e4d71"
ctr: failed to extract layer sha256:0f87b58539d6201873662f4d4c1c7f4097367770530b6451df54f3539231b26e: failed to mount /var/lib/containerd/tmpmounts/containerd-mount550654010: no such file or directory: unknown

containerd version

➜  containerd git:(master) containerd --version
containerd github.com/containerd/containerd v1.1.0-196-g0d52c71 0d52c71c805ec8b4e1d683e919d6914845067794

Proposal

I have two proposals for this issue:

symlink `snapshots` dir by symbol

Create link the snapshots dir, like /tmp/ctrdl, to compact the lowerdir option. However, the tmp link is out of control, like clean up by tmpwatch. It's also hard to maintain the link during start/stop/start...

use `reexec` to change work dir before mount

Like moby, use reexec to fork process to do mount thing. Since the snapshoter service provides the mount option, no mount action, I want to change the github.com/containerd/containerd/mount behavour, like

// https://github.com/containerd/containerd/blob/master/mount/mount.go

// Mount is the lingua franca of containerd. A mount represents a
// serialized mount syscall. Components either emit or consume mounts.
 type Mount struct {
+       // Chdir allows to change working directory to avoid to exceed max of
+       // the mount option buffer (page size) in kernel.
+       Chdir string
        // Type specifies the host-specific of the mount.
        Type string
        // Source specifies where to mount from. Depending on the host system, this

The func (m *Mount) Mount will use reexec if the Chdir is not empty. For the overlayfs, the containerd will change work dir into snapshots and mount the layers, like

➜  containerd git:(proposal) ctr run --rm -t localhost:5000/t:latest sh

overlay on /run/containerd/io.containerd.runtime.v1.linux/default/sh/rootfs type overlay (rw,relatime,lowerdir=209/fs:208/fs:207/fs:206/fs:205/fs:204/fs:203/fs:202/fs:201/fs:200/fs:199/fs:198/fs:197/fs:196/fs:195/fs:194/fs:193/fs:192/fs:191/fs:190/fs:189/fs:188/fs:187/fs:186/fs:185/fs:184/fs:183/fs:182/fs:181/fs:180/fs:179/fs:178/fs:177/fs:176/fs:175/fs:174/fs:173/fs:172/fs:171/fs:170/fs:169/fs:168/fs:167/fs:166/fs:165/fs:164/fs:163/fs:162/fs:161/fs:160/fs:159/fs:158/fs:157/fs:156/fs:155/fs:154/fs:153/fs,upperdir=210/fs,workdir=210/work)

The max of snapshot id will take 20 digits so that it can support more than 128 layers for overlayfs.

Since snapshot service, task service and runtime/shim service are consuming the mount option, this proposal will change the proto file, too.

It seems that reexec can handle more layers in overlayfs well. However, it need to change the API and involve reexec behaviour like moby. Does it make senses?

ping @dmcgowan and @AkihiroSuda

The text was updated successfully, but these errors were encountered:

AkihiroSuda · 2018-07-25T08:18:22Z

Ideally we should reexec+chdir but I fear it breaks API compatibility.

Can we keep the current API and add an overlayfs-specific chdir hack to https://github.com/containerd/containerd/blob/master/mount/mount_linux.go ?

fuweid · 2018-07-25T08:34:14Z

Can we keep the current API and add an overlayfs-specific chdir hack to https://github.com/containerd/containerd/blob/master/mount/mount_linux.go ?

Agree with the API compatibility point.

I think we can use Type:overlay and Option:chdir opt to hack the chdir, like:

// https://github.com/containerd/containerd/blob/master/snapshots/overlay/overlay.go#L492
options = append(options, fmt.Sprintf("lowerdir=%s", strings.Join(parentPaths, ":")))
+ options = append(options, fmt.Sprintf("chdir=%s", filepath.Join(o.root, "snapshot"))

// https://github.com/containerd/containerd/blob/master/mount/mount_linux.go
check the option contains chdir or not

Doable but hacker. 😄

AkihiroSuda · 2018-07-25T08:46:14Z

I think we can use Type:overlay and Option:chdir opt to hack the chdir, like:

That would make containerd v1.1 client unaccessible to containerd v1.2 daemon.
Can we let mount_linux.go auto-detect chdir path from options values?

ijc · 2018-07-25T09:19:07Z

I guess it'll be a while before it is generally available enough for us to rely on, but it seems there is work afoot in kernel-land which would remove the 4k limit to mount options: Six (or seven) new system calls for filesystem mounting.

fuweid · 2018-07-25T09:42:21Z

That would make containerd v1.1 client unaccessible to containerd v1.2 daemon.

The client just sends the mount info into diff.apply service and the mount info comes from snapshot service. I think the lower-version client can access the higher-version daemon. Is that correct?

AkihiroSuda · 2018-07-25T09:57:37Z

That's true if the client just let the daemon mount(2), but IIUC the client should be also allowed to mount(2) by itself.

dmcgowan · 2018-07-25T18:38:30Z

Thanks for bringing this up, this was next on my 1.2 TODO of issues to look at

My plan was to implement this functionality entirely in the mount package, and preventing any API facing changes or changes to the snapshotters. Any snapshotter, not just the included overlay snapshotter should be able to return standard overlay mounts without having to add customized mount options. This puts the burden on the mount logic to figure out how to perform that mount based on the current systems limitations. I can think of 3 ways to do, 2 of which you already mentioned.

Symlink - (1) Find a common prefix amongst the lowerdir values (2) Create a symlink in the /tmp directory to that location (3) Replace the common prefix in lowerdir values. (4) Perform mount
2 layer mount - (1) Remove a number of paths from end of lowerdir (2) Perform a read only overlay mount in temp (3) Replace the included lowerdir paths with the single mount (4) repeat process, excluding any directories which are already overlay since only 2 levels of overlay are allow, until the mount can be performed on the system (5) Perform mount
re-exec - (1) Find common prefix amongst the lowerdir (2) Make lowerdir paths relative to common prefix (3) Execute subprocess (4) Chdir to common directory (5) Perform mount in subprocess

I ordered those by which I think is the easiest/best options. I think reexec is the worst option and would like to avoid it if possible (although did see some code somewhere that was able to successfully perform a real fork in Go). The second option is a bit confusing, but it would allow the deepest support for overlay by leveraging overlays support for having overlay mount in the lowerdir, but NOT if that overlay mount also has a lowerdir in its overlay mount, hence 2 level max. The first option is the fastest and easiest and the only additional systemc alls required is symlink, which is must faster than mounting or execing.

fuweid · 2018-07-26T10:44:30Z

@dmcgowan Thank you for the detailed explanation!

Both Symlink and 2 layer mount are better than re-exec. But they also causes the burden about cleanup the tmpdir or symlinks after mount. In this point, I prefer to re-exec.

However, we don't need to chdir all the time. Skip the compact lowerdir option if the number of bytes of options doesn't hit the limitation. I make PR #2502 about this. how do you think about this?

dmcgowan · 2018-07-26T20:06:15Z

@fuweid using Docker's re-exec is not an option. We don't pull in pkg from docker/docker/pkg. That being said, that re-exec approach is not really a good design and mostly working around a difficulty in Go around supporting fork. Gvisor seemed to accomplish this though in https://github.com/google/gvisor/tree/797cda301677abc8523d5a2a8d731312cc43bce4/pkg/sentry/platform.

But they also causes the burden about cleanup the tmpdir or symlinks after mount.

The mount package would also be responsible for taking care of this

fuweid · 2018-07-27T01:38:16Z

@dmcgowan I will try the Gvisor way. Thank you!

ijc · 2018-07-27T09:53:01Z

Isn't the need for rexec pretty much eliminated by the changes to runtime.LockOSThread in GO v1.10? See second paragraph of Runtime in the go1.10 release notes.

It's now safe to runtime.LockOSThtread and then Chdir or flip namespace as much as you like, that OS thread is now tainted and will never be used for another goroutine after the current one exits.

dmcgowan · 2018-07-27T16:50:56Z

It's now safe to runtime.LockOSThtread and then Chdir

There is still no way to prevent the os.Chdir from having an effect on the whole process. However if there is a way to call http://man7.org/linux/man-pages/man2/clone.2.html without CLONE_FS, that is probably the best option.

fuweid · 2018-08-10T08:15:07Z

Thanks @dmcgowan , @ijc and @AkihiroSuda for the help!

Atry · 2024-10-28T18:04:23Z

There is still a limit of number of layers around 260 according to my test.

fuweid changed the title ~~[bug & proposal] support more-layers-image by overlayfs~~ [bug & proposal] overlayfs support more-layers-image Jul 25, 2018

fuweid mentioned this issue Jul 26, 2018

support more overlayfs layers #2502

Merged

crosbymichael closed this as completed Aug 9, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bug & proposal] overlayfs support more-layers-image #2497

[bug & proposal] overlayfs support more-layers-image #2497

fuweid commented Jul 25, 2018

AkihiroSuda commented Jul 25, 2018

fuweid commented Jul 25, 2018

AkihiroSuda commented Jul 25, 2018

ijc commented Jul 25, 2018

fuweid commented Jul 25, 2018

AkihiroSuda commented Jul 25, 2018

dmcgowan commented Jul 25, 2018

fuweid commented Jul 26, 2018

dmcgowan commented Jul 26, 2018

fuweid commented Jul 27, 2018

ijc commented Jul 27, 2018

dmcgowan commented Jul 27, 2018

fuweid commented Aug 10, 2018

Atry commented Oct 28, 2024

[bug & proposal] overlayfs support more-layers-image #2497

[bug & proposal] overlayfs support more-layers-image #2497

Comments

fuweid commented Jul 25, 2018

Bug

Description

Reproduce

containerd version

Proposal

symlink snapshots dir by symbol

use reexec to change work dir before mount

AkihiroSuda commented Jul 25, 2018

fuweid commented Jul 25, 2018

AkihiroSuda commented Jul 25, 2018

ijc commented Jul 25, 2018

fuweid commented Jul 25, 2018

AkihiroSuda commented Jul 25, 2018

dmcgowan commented Jul 25, 2018

fuweid commented Jul 26, 2018

dmcgowan commented Jul 26, 2018

fuweid commented Jul 27, 2018

ijc commented Jul 27, 2018

dmcgowan commented Jul 27, 2018

fuweid commented Aug 10, 2018

Atry commented Oct 28, 2024

symlink `snapshots` dir by symbol

use `reexec` to change work dir before mount