Bug
Description
The number of layers is limited by the max size of the mount option buffer in the kernel (1 page/4096bytes in general). For now, containerd uses the absolute path of snapshoter. Basically, the root path is like /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots[68]. We cannot pull image which has almost 60 layers (~ 4096/68).
Reproduce
From busybox:latest
# repeat the following line with 60 times
RUN mktemp
ctr pull ${imageName} and you will got:
INFO[0001] apply failure, attempting cleanup error="failed to extract layer sha256:0f87b58539d6201873662f4d4c1c7f4097367770530b6451df54f3539231b26e: failed to mount /var/lib/containerd/tmpmounts/containerd-mount550654010: no such file or directory: unknown" key="extract-982113456-BiJs sha256:a4de2c26d7e33d26f51b87a4637f312edc3da2fc2f566f9485c774d8549e4d71"
ctr: failed to extract layer sha256:0f87b58539d6201873662f4d4c1c7f4097367770530b6451df54f3539231b26e: failed to mount /var/lib/containerd/tmpmounts/containerd-mount550654010: no such file or directory: unknown
containerd version
➜ containerd git:(master) containerd --version
containerd github.com/containerd/containerd v1.1.0-196-g0d52c71 0d52c71c805ec8b4e1d683e919d6914845067794
Proposal
I have two proposals for this issue:
symlink snapshots dir by symbol
Create link the snapshots dir, like /tmp/ctrdl, to compact the lowerdir option. However, the tmp link is out of control, like clean up by tmpwatch. It's also hard to maintain the link during start/stop/start...
use reexec to change work dir before mount
Like moby, use reexec to fork process to do mount thing. Since the snapshoter service provides the mount option, no mount action, I want to change the github.com/containerd/containerd/mount behavour, like
// https://github.com/containerd/containerd/blob/master/mount/mount.go
// Mount is the lingua franca of containerd. A mount represents a
// serialized mount syscall. Components either emit or consume mounts.
type Mount struct {
+ // Chdir allows to change working directory to avoid to exceed max of
+ // the mount option buffer (page size) in kernel.
+ Chdir string
// Type specifies the host-specific of the mount.
Type string
// Source specifies where to mount from. Depending on the host system, this
The func (m *Mount) Mount will use reexec if the Chdir is not empty. For the overlayfs, the containerd will change work dir into snapshots and mount the layers, like
➜ containerd git:(proposal) ctr run --rm -t localhost:5000/t:latest sh
overlay on /run/containerd/io.containerd.runtime.v1.linux/default/sh/rootfs type overlay (rw,relatime,lowerdir=209/fs:208/fs:207/fs:206/fs:205/fs:204/fs:203/fs:202/fs:201/fs:200/fs:199/fs:198/fs:197/fs:196/fs:195/fs:194/fs:193/fs:192/fs:191/fs:190/fs:189/fs:188/fs:187/fs:186/fs:185/fs:184/fs:183/fs:182/fs:181/fs:180/fs:179/fs:178/fs:177/fs:176/fs:175/fs:174/fs:173/fs:172/fs:171/fs:170/fs:169/fs:168/fs:167/fs:166/fs:165/fs:164/fs:163/fs:162/fs:161/fs:160/fs:159/fs:158/fs:157/fs:156/fs:155/fs:154/fs:153/fs,upperdir=210/fs,workdir=210/work)
The max of snapshot id will take 20 digits so that it can support more than 128 layers for overlayfs.
Since snapshot service, task service and runtime/shim service are consuming the mount option, this proposal will change the proto file, too.
It seems that reexec can handle more layers in overlayfs well. However, it need to change the API and involve reexec behaviour like moby. Does it make senses?
ping @dmcgowan and @AkihiroSuda
Bug
Description
The number of layers is limited by the max size of the mount option buffer in the kernel (1 page/4096bytes in general). For now, containerd uses the absolute path of snapshoter. Basically, the
rootpath is like/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots[68]. We cannot pull image which has almost 60 layers (~ 4096/68).Reproduce
ctr pull
${imageName}and you will got:containerd version
Proposal
I have two proposals for this issue:
symlinksnapshotsdir by symbolCreate link the
snapshotsdir, like/tmp/ctrdl, tocompactthelowerdiroption. However, the tmp link is out of control, like clean up bytmpwatch. It's also hard to maintain the link during start/stop/start...use
reexecto change work dir before mountLike moby, use
reexecto fork process to do mount thing. Since the snapshoter service provides themountoption, no mount action, I want to change thegithub.com/containerd/containerd/mountbehavour, likeThe
func (m *Mount) Mountwill usereexecif theChdiris not empty. For the overlayfs, the containerd will change work dir intosnapshotsand mount the layers, likeThe max of snapshot id will take 20 digits so that it can support more than 128 layers for overlayfs.
Since
snapshot service,task serviceandruntime/shim serviceare consuming the mount option, this proposal will change the proto file, too.It seems that
reexeccan handle more layers in overlayfs well. However, it need to change the API and involvereexecbehaviour like moby. Does it make senses?ping @dmcgowan and @AkihiroSuda