Skip to content

[bug & proposal] overlayfs support more-layers-image #2497

@fuweid

Description

@fuweid

Bug

Description

The number of layers is limited by the max size of the mount option buffer in the kernel (1 page/4096bytes in general). For now, containerd uses the absolute path of snapshoter. Basically, the root path is like /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots[68]. We cannot pull image which has almost 60 layers (~ 4096/68).

Reproduce

From busybox:latest

# repeat the following line with 60 times
RUN mktemp

ctr pull ${imageName} and you will got:

INFO[0001] apply failure, attempting cleanup             error="failed to extract layer sha256:0f87b58539d6201873662f4d4c1c7f4097367770530b6451df54f3539231b26e: failed to mount /var/lib/containerd/tmpmounts/containerd-mount550654010: no such file or directory: unknown" key="extract-982113456-BiJs sha256:a4de2c26d7e33d26f51b87a4637f312edc3da2fc2f566f9485c774d8549e4d71"
ctr: failed to extract layer sha256:0f87b58539d6201873662f4d4c1c7f4097367770530b6451df54f3539231b26e: failed to mount /var/lib/containerd/tmpmounts/containerd-mount550654010: no such file or directory: unknown

containerd version

➜  containerd git:(master) containerd --version
containerd github.com/containerd/containerd v1.1.0-196-g0d52c71 0d52c71c805ec8b4e1d683e919d6914845067794

Proposal

I have two proposals for this issue:

symlink snapshots dir by symbol

Create link the snapshots dir, like /tmp/ctrdl, to compact the lowerdir option. However, the tmp link is out of control, like clean up by tmpwatch. It's also hard to maintain the link during start/stop/start...

use reexec to change work dir before mount

Like moby, use reexec to fork process to do mount thing. Since the snapshoter service provides the mount option, no mount action, I want to change the github.com/containerd/containerd/mount behavour, like

// https://github.com/containerd/containerd/blob/master/mount/mount.go

// Mount is the lingua franca of containerd. A mount represents a
// serialized mount syscall. Components either emit or consume mounts.
 type Mount struct {
+       // Chdir allows to change working directory to avoid to exceed max of
+       // the mount option buffer (page size) in kernel.
+       Chdir string
        // Type specifies the host-specific of the mount.
        Type string
        // Source specifies where to mount from. Depending on the host system, this

The func (m *Mount) Mount will use reexec if the Chdir is not empty. For the overlayfs, the containerd will change work dir into snapshots and mount the layers, like

➜  containerd git:(proposal) ctr run --rm -t localhost:5000/t:latest sh

overlay on /run/containerd/io.containerd.runtime.v1.linux/default/sh/rootfs type overlay (rw,relatime,lowerdir=209/fs:208/fs:207/fs:206/fs:205/fs:204/fs:203/fs:202/fs:201/fs:200/fs:199/fs:198/fs:197/fs:196/fs:195/fs:194/fs:193/fs:192/fs:191/fs:190/fs:189/fs:188/fs:187/fs:186/fs:185/fs:184/fs:183/fs:182/fs:181/fs:180/fs:179/fs:178/fs:177/fs:176/fs:175/fs:174/fs:173/fs:172/fs:171/fs:170/fs:169/fs:168/fs:167/fs:166/fs:165/fs:164/fs:163/fs:162/fs:161/fs:160/fs:159/fs:158/fs:157/fs:156/fs:155/fs:154/fs:153/fs,upperdir=210/fs,workdir=210/work)

The max of snapshot id will take 20 digits so that it can support more than 128 layers for overlayfs.

Since snapshot service, task service and runtime/shim service are consuming the mount option, this proposal will change the proto file, too.

It seems that reexec can handle more layers in overlayfs well. However, it need to change the API and involve reexec behaviour like moby. Does it make senses?

ping @dmcgowan and @AkihiroSuda

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions