Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug & proposal] overlayfs support more-layers-image #2497

Closed
fuweid opened this issue Jul 25, 2018 · 14 comments
Closed

[bug & proposal] overlayfs support more-layers-image #2497

fuweid opened this issue Jul 25, 2018 · 14 comments

Comments

@fuweid
Copy link
Member

fuweid commented Jul 25, 2018

Bug

Description

The number of layers is limited by the max size of the mount option buffer in the kernel (1 page/4096bytes in general). For now, containerd uses the absolute path of snapshoter. Basically, the root path is like /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots[68]. We cannot pull image which has almost 60 layers (~ 4096/68).

Reproduce

From busybox:latest

# repeat the following line with 60 times
RUN mktemp

ctr pull ${imageName} and you will got:

INFO[0001] apply failure, attempting cleanup             error="failed to extract layer sha256:0f87b58539d6201873662f4d4c1c7f4097367770530b6451df54f3539231b26e: failed to mount /var/lib/containerd/tmpmounts/containerd-mount550654010: no such file or directory: unknown" key="extract-982113456-BiJs sha256:a4de2c26d7e33d26f51b87a4637f312edc3da2fc2f566f9485c774d8549e4d71"
ctr: failed to extract layer sha256:0f87b58539d6201873662f4d4c1c7f4097367770530b6451df54f3539231b26e: failed to mount /var/lib/containerd/tmpmounts/containerd-mount550654010: no such file or directory: unknown

containerd version

➜  containerd git:(master) containerd --version
containerd github.com/containerd/containerd v1.1.0-196-g0d52c71 0d52c71c805ec8b4e1d683e919d6914845067794

Proposal

I have two proposals for this issue:

symlink snapshots dir by symbol

Create link the snapshots dir, like /tmp/ctrdl, to compact the lowerdir option. However, the tmp link is out of control, like clean up by tmpwatch. It's also hard to maintain the link during start/stop/start...

use reexec to change work dir before mount

Like moby, use reexec to fork process to do mount thing. Since the snapshoter service provides the mount option, no mount action, I want to change the github.com/containerd/containerd/mount behavour, like

// https://github.com/containerd/containerd/blob/master/mount/mount.go

// Mount is the lingua franca of containerd. A mount represents a
// serialized mount syscall. Components either emit or consume mounts.
 type Mount struct {
+       // Chdir allows to change working directory to avoid to exceed max of
+       // the mount option buffer (page size) in kernel.
+       Chdir string
        // Type specifies the host-specific of the mount.
        Type string
        // Source specifies where to mount from. Depending on the host system, this

The func (m *Mount) Mount will use reexec if the Chdir is not empty. For the overlayfs, the containerd will change work dir into snapshots and mount the layers, like

➜  containerd git:(proposal) ctr run --rm -t localhost:5000/t:latest sh

overlay on /run/containerd/io.containerd.runtime.v1.linux/default/sh/rootfs type overlay (rw,relatime,lowerdir=209/fs:208/fs:207/fs:206/fs:205/fs:204/fs:203/fs:202/fs:201/fs:200/fs:199/fs:198/fs:197/fs:196/fs:195/fs:194/fs:193/fs:192/fs:191/fs:190/fs:189/fs:188/fs:187/fs:186/fs:185/fs:184/fs:183/fs:182/fs:181/fs:180/fs:179/fs:178/fs:177/fs:176/fs:175/fs:174/fs:173/fs:172/fs:171/fs:170/fs:169/fs:168/fs:167/fs:166/fs:165/fs:164/fs:163/fs:162/fs:161/fs:160/fs:159/fs:158/fs:157/fs:156/fs:155/fs:154/fs:153/fs,upperdir=210/fs,workdir=210/work)

The max of snapshot id will take 20 digits so that it can support more than 128 layers for overlayfs.

Since snapshot service, task service and runtime/shim service are consuming the mount option, this proposal will change the proto file, too.

It seems that reexec can handle more layers in overlayfs well. However, it need to change the API and involve reexec behaviour like moby. Does it make senses?

ping @dmcgowan and @AkihiroSuda

@fuweid fuweid changed the title [bug & proposal] support more-layers-image by overlayfs [bug & proposal] overlayfs support more-layers-image Jul 25, 2018
@AkihiroSuda
Copy link
Member

Ideally we should reexec+chdir but I fear it breaks API compatibility.

Can we keep the current API and add an overlayfs-specific chdir hack to https://github.com/containerd/containerd/blob/master/mount/mount_linux.go ?

@fuweid
Copy link
Member Author

fuweid commented Jul 25, 2018

Can we keep the current API and add an overlayfs-specific chdir hack to https://github.com/containerd/containerd/blob/master/mount/mount_linux.go ?

Agree with the API compatibility point.

I think we can use Type:overlay and Option:chdir opt to hack the chdir, like:

// https://github.com/containerd/containerd/blob/master/snapshots/overlay/overlay.go#L492
options = append(options, fmt.Sprintf("lowerdir=%s", strings.Join(parentPaths, ":")))
+ options = append(options, fmt.Sprintf("chdir=%s", filepath.Join(o.root, "snapshot"))

// https://github.com/containerd/containerd/blob/master/mount/mount_linux.go
check the option contains chdir or not

Doable but hacker. 😄

@AkihiroSuda
Copy link
Member

I think we can use Type:overlay and Option:chdir opt to hack the chdir, like:

That would make containerd v1.1 client unaccessible to containerd v1.2 daemon.
Can we let mount_linux.go auto-detect chdir path from options values?

@ijc
Copy link
Contributor

ijc commented Jul 25, 2018

I guess it'll be a while before it is generally available enough for us to rely on, but it seems there is work afoot in kernel-land which would remove the 4k limit to mount options: Six (or seven) new system calls for filesystem mounting.

@fuweid
Copy link
Member Author

fuweid commented Jul 25, 2018

That would make containerd v1.1 client unaccessible to containerd v1.2 daemon.

The client just sends the mount info into diff.apply service and the mount info comes from snapshot service. I think the lower-version client can access the higher-version daemon. Is that correct?

@AkihiroSuda
Copy link
Member

That's true if the client just let the daemon mount(2), but IIUC the client should be also allowed to mount(2) by itself.

@dmcgowan
Copy link
Member

Thanks for bringing this up, this was next on my 1.2 TODO of issues to look at

My plan was to implement this functionality entirely in the mount package, and preventing any API facing changes or changes to the snapshotters. Any snapshotter, not just the included overlay snapshotter should be able to return standard overlay mounts without having to add customized mount options. This puts the burden on the mount logic to figure out how to perform that mount based on the current systems limitations. I can think of 3 ways to do, 2 of which you already mentioned.

  • Symlink - (1) Find a common prefix amongst the lowerdir values (2) Create a symlink in the /tmp directory to that location (3) Replace the common prefix in lowerdir values. (4) Perform mount
  • 2 layer mount - (1) Remove a number of paths from end of lowerdir (2) Perform a read only overlay mount in temp (3) Replace the included lowerdir paths with the single mount (4) repeat process, excluding any directories which are already overlay since only 2 levels of overlay are allow, until the mount can be performed on the system (5) Perform mount
  • re-exec - (1) Find common prefix amongst the lowerdir (2) Make lowerdir paths relative to common prefix (3) Execute subprocess (4) Chdir to common directory (5) Perform mount in subprocess

I ordered those by which I think is the easiest/best options. I think reexec is the worst option and would like to avoid it if possible (although did see some code somewhere that was able to successfully perform a real fork in Go). The second option is a bit confusing, but it would allow the deepest support for overlay by leveraging overlays support for having overlay mount in the lowerdir, but NOT if that overlay mount also has a lowerdir in its overlay mount, hence 2 level max. The first option is the fastest and easiest and the only additional systemc alls required is symlink, which is must faster than mounting or execing.

@fuweid
Copy link
Member Author

fuweid commented Jul 26, 2018

@dmcgowan Thank you for the detailed explanation!

Both Symlink and 2 layer mount are better than re-exec. But they also causes the burden about cleanup the tmpdir or symlinks after mount. In this point, I prefer to re-exec.

However, we don't need to chdir all the time. Skip the compact lowerdir option if the number of bytes of options doesn't hit the limitation. I make PR #2502 about this. how do you think about this?

@dmcgowan
Copy link
Member

@fuweid using Docker's re-exec is not an option. We don't pull in pkg from docker/docker/pkg. That being said, that re-exec approach is not really a good design and mostly working around a difficulty in Go around supporting fork. Gvisor seemed to accomplish this though in https://github.com/google/gvisor/tree/797cda301677abc8523d5a2a8d731312cc43bce4/pkg/sentry/platform.

But they also causes the burden about cleanup the tmpdir or symlinks after mount.

The mount package would also be responsible for taking care of this

@fuweid
Copy link
Member Author

fuweid commented Jul 27, 2018

@dmcgowan I will try the Gvisor way. Thank you!

@ijc
Copy link
Contributor

ijc commented Jul 27, 2018

Isn't the need for rexec pretty much eliminated by the changes to runtime.LockOSThread in GO v1.10? See second paragraph of Runtime in the go1.10 release notes.

It's now safe to runtime.LockOSThtread and then Chdir or flip namespace as much as you like, that OS thread is now tainted and will never be used for another goroutine after the current one exits.

@dmcgowan
Copy link
Member

It's now safe to runtime.LockOSThtread and then Chdir

There is still no way to prevent the os.Chdir from having an effect on the whole process. However if there is a way to call http://man7.org/linux/man-pages/man2/clone.2.html without CLONE_FS, that is probably the best option.

@fuweid
Copy link
Member Author

fuweid commented Aug 10, 2018

Thanks @dmcgowan , @ijc and @AkihiroSuda for the help!

@Atry
Copy link

Atry commented Oct 28, 2024

There is still a limit of number of layers around 260 according to my test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants