Skip to content

Containers slow to start after init container finishes: failed to handle container TaskExit event #8698

@alam0rt

Description

@alam0rt

Description

Note that this only is a problem when the node has just come up. After running for a few minutes or so, everything is fine.

I noticed that our aws-cni pods were taking a long time to run after starting. The init container runs and complete within a second, but the main container sometimes takes over a minute before it began execution.

Looking at the kubelet logs, I saw the Relisting method (which usually runs once per second) take well over 30

kubelet_init[1680]: I0602 05:48:06.350695    1679 generic.go:191] "GenericPLEG: Relisting"
kubelet_init[1680]: I0602 05:48:07.352436    1679 generic.go:191] "GenericPLEG: Relisting"
kubelet_init[1680]: I0602 05:48:49.669827    1679 generic.go:191] "GenericPLEG: Relisting"
kubelet_init[1680]: I0602 05:48:50.672822    1679 generic.go:191] "GenericPLEG: Relisting"
kubelet_init[1680]: I0602 05:48:51.676857    1679 generic.go:191] "GenericPLEG: Relisting"

This, to the best of my knowledge calls the CRI of containerd for a list of all the running pods.

Looking into containerd's logs I saw signs that containerd or the shim was having trouble stopping the container

containerd[2206]: {"error":"failed to stop container: failed to delete task: context deadline exceeded:
 unknown","level":"error","msg":"failed to handle container TaskExit event \u0026TaskExit{ContainerID:8f71eadd772ffde7f9537998844aa3f7a8a4f15ec182a5943ef0d60f242b1929,ID:8f71eadd772ffde7f9537998844aa3f7a8a4f15ec182a5943ef0d60f242b1929,Pid:3658,ExitStatus:0,ExitedAt:2023-06-15 06:39:36.566238047 +0000 UTC,XXX_unrecognized:[],}","time":"2023-06-15T06:39:46.568049942Z"}

In the goroutine dump, I can see (L677) that a goroutine is waiting on the the Delete call to dispatch.

containerd-goroutines

In the containerd-shim-runc-v2 goroutines, it appears to be hanging on the umount syscall, not sure though.

goroutine 104 [syscall]:
syscall.Syscall(0x7c?, 0xc000161800?, 0xc0000c2200?, 0x7d?)
        /usr/local/go/src/syscall/syscall_linux.go:68 +0x27
golang.org/x/sys/unix.Unmount({0xc0000c2200?, 0x0?}, 0xc00029c448?)
        /go/src/github.com/containerd/containerd/vendor/golang.org/x/sys/unix/zsyscall_linux.go:1680 +0x7d
github.com/containerd/containerd/mount.unmount({0xc0000c2200, 0x7c}, 0x9ed280?)
        /go/src/github.com/containerd/containerd/mount/mount_linux.go:155 +0x8c
github.com/containerd/containerd/mount.UnmountAll({0xc0000c2200, 0x7c}, 0xc000065080?)
        /go/src/github.com/containerd/containerd/mount/mount_linux.go:185 +0x8c
github.com/containerd/containerd/pkg/process.(*Init).delete(0xc0000d2c60, {0x9f1460, 0xc000065080})
        /go/src/github.com/containerd/containerd/pkg/process/init.go:312 +0x13a
github.com/containerd/containerd/pkg/process.(*stoppedState).Delete(0xc000012ab8, {0x9f1460?, 0xc000065080?})
        /go/src/github.com/containerd/containerd/pkg/process/init_state.go:396 +0x2e
github.com/containerd/containerd/pkg/process.(*Init).Delete(0xc00002d180?, {0x9f1460?, 0xc000065080?})
        /go/src/github.com/containerd/containerd/pkg/process/init.go:288 +0xb6
github.com/containerd/containerd/runtime/v2/runc.(*Container).Delete(0xc000132380?, {0x9f1460, 0xc000065080}, 0xc0003071c0)
        /go/src/github.com/containerd/containerd/runtime/v2/runc/container.go:390 +0x62
github.com/containerd/containerd/runtime/v2/runc/task.(*service).Delete(0xc000132380, {0x9f1460, 0xc000065080}, 0xc0003071c0)
        /go/src/github.com/containerd/containerd/runtime/v2/runc/task/service.go:216 +0x75
github.com/containerd/containerd/runtime/v2/task.RegisterTaskService.func4({0x9f1460, 0xc000065080}, 0xc00002ee00)
        /go/src/github.com/containerd/containerd/runtime/v2/task/shim.pb.go:3477 +0x98
github.com/containerd/ttrpc.defaultServerInterceptor({0x9f1460?, 0xc000065080?}, 0x17?, 0xc000134760?, 0x6?)
        /go/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/interceptor.go:45 +0x26
github.com/containerd/ttrpc.(*serviceSet).dispatch(0xc000118650, {0x9f1460, 0xc000065080}, {0xc00002aba0, 0x17}, {0xc00027e22a, 0x6}, {0xc000180280, 0x42, 0x50})
        /go/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/services.go:95 +0x1be
github.com/containerd/ttrpc.(*serviceSet).call(0xd0e330?, {0x9f1460?, 0xc000065080?}, {0xc00002aba0?, 0x84eac0?}, {0xc00027e22a?, 0xc000308000?}, {0xc000180280, 0x42, 0x50})
        /go/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/services.go:64 +0x71
github.com/containerd/ttrpc.(*serverConn).run.func2(0x1b)
        /go/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/server.go:439 +0xe5
created by github.com/containerd/ttrpc.(*serverConn).run
        /go/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/server.go:435 +0x6b1

shim-goroutines (each shim goroutine is seperated by a BREAK)

Steps to reproduce the issue

Unfortunately, I don't know how, but this occurs across all nodes. I've tried different versions of runc, using systemd over cgroupfs, same issue.

We were also restarting containerd while booting (as we reconfigure some things at boot time), but even after removing the reboot, same behaviour.

Describe the results you received and expected

The main container should start pretty much straight after the init container finishes.

Kubelet PLEG relisting should not block for too long, nor should I get.

E0602 05:48:48.669179    1679 kubelet.go:2190] "Housekeeping took longer than 15s" err="housekeeping took too long" seconds=41.340653962

What version of containerd are you using?

containerd github.com/containerd/containerd v1.6.21 3dce8eb

Any other relevant information

runc version 1.1.7
commit: v1.1.7-0-g860f061b
spec: 1.0.2-dev
go: go1.20.3
libseccomp: 2.5.4
5.15.0-1037-aws #41~20.04.1-Ubuntu

Let me know what details you want from crictl info and I can provide (don't want to paste entire thing)

Show configuration if it is related to CRI plugin.

(I have used multiple configurations, below is the EKS config)

version = 2
root = "/var/lib/containerd"
state = "/run/containerd"

[grpc]
address = "/run/containerd/containerd.sock"

[plugins."io.containerd.grpc.v1.cri".containerd]
default_runtime_name = "runc"

[plugins."io.containerd.grpc.v1.cri".registry]
config_path = "/etc/containerd/certs.d:/etc/docker/certs.d"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
runtime_type = "io.containerd.runc.v2"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
SystemdCgroup = true

[plugins."io.containerd.grpc.v1.cri".cni]
bin_dir = "/opt/cni/bin"
conf_dir = "/etc/cni/net.d"

Metadata

Metadata

Assignees

No one assigned

    Type

    Projects

    Status

    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions