-
Notifications
You must be signed in to change notification settings - Fork 3.8k
Description
Description
Note that this only is a problem when the node has just come up. After running for a few minutes or so, everything is fine.
I noticed that our aws-cni pods were taking a long time to run after starting. The init container runs and complete within a second, but the main container sometimes takes over a minute before it began execution.
Looking at the kubelet logs, I saw the Relisting method (which usually runs once per second) take well over 30
kubelet_init[1680]: I0602 05:48:06.350695 1679 generic.go:191] "GenericPLEG: Relisting"
kubelet_init[1680]: I0602 05:48:07.352436 1679 generic.go:191] "GenericPLEG: Relisting"
kubelet_init[1680]: I0602 05:48:49.669827 1679 generic.go:191] "GenericPLEG: Relisting"
kubelet_init[1680]: I0602 05:48:50.672822 1679 generic.go:191] "GenericPLEG: Relisting"
kubelet_init[1680]: I0602 05:48:51.676857 1679 generic.go:191] "GenericPLEG: Relisting"
This, to the best of my knowledge calls the CRI of containerd for a list of all the running pods.
Looking into containerd's logs I saw signs that containerd or the shim was having trouble stopping the container
containerd[2206]: {"error":"failed to stop container: failed to delete task: context deadline exceeded:
unknown","level":"error","msg":"failed to handle container TaskExit event \u0026TaskExit{ContainerID:8f71eadd772ffde7f9537998844aa3f7a8a4f15ec182a5943ef0d60f242b1929,ID:8f71eadd772ffde7f9537998844aa3f7a8a4f15ec182a5943ef0d60f242b1929,Pid:3658,ExitStatus:0,ExitedAt:2023-06-15 06:39:36.566238047 +0000 UTC,XXX_unrecognized:[],}","time":"2023-06-15T06:39:46.568049942Z"}
In the goroutine dump, I can see (L677) that a goroutine is waiting on the the Delete call to dispatch.
In the containerd-shim-runc-v2 goroutines, it appears to be hanging on the umount syscall, not sure though.
goroutine 104 [syscall]:
syscall.Syscall(0x7c?, 0xc000161800?, 0xc0000c2200?, 0x7d?)
/usr/local/go/src/syscall/syscall_linux.go:68 +0x27
golang.org/x/sys/unix.Unmount({0xc0000c2200?, 0x0?}, 0xc00029c448?)
/go/src/github.com/containerd/containerd/vendor/golang.org/x/sys/unix/zsyscall_linux.go:1680 +0x7d
github.com/containerd/containerd/mount.unmount({0xc0000c2200, 0x7c}, 0x9ed280?)
/go/src/github.com/containerd/containerd/mount/mount_linux.go:155 +0x8c
github.com/containerd/containerd/mount.UnmountAll({0xc0000c2200, 0x7c}, 0xc000065080?)
/go/src/github.com/containerd/containerd/mount/mount_linux.go:185 +0x8c
github.com/containerd/containerd/pkg/process.(*Init).delete(0xc0000d2c60, {0x9f1460, 0xc000065080})
/go/src/github.com/containerd/containerd/pkg/process/init.go:312 +0x13a
github.com/containerd/containerd/pkg/process.(*stoppedState).Delete(0xc000012ab8, {0x9f1460?, 0xc000065080?})
/go/src/github.com/containerd/containerd/pkg/process/init_state.go:396 +0x2e
github.com/containerd/containerd/pkg/process.(*Init).Delete(0xc00002d180?, {0x9f1460?, 0xc000065080?})
/go/src/github.com/containerd/containerd/pkg/process/init.go:288 +0xb6
github.com/containerd/containerd/runtime/v2/runc.(*Container).Delete(0xc000132380?, {0x9f1460, 0xc000065080}, 0xc0003071c0)
/go/src/github.com/containerd/containerd/runtime/v2/runc/container.go:390 +0x62
github.com/containerd/containerd/runtime/v2/runc/task.(*service).Delete(0xc000132380, {0x9f1460, 0xc000065080}, 0xc0003071c0)
/go/src/github.com/containerd/containerd/runtime/v2/runc/task/service.go:216 +0x75
github.com/containerd/containerd/runtime/v2/task.RegisterTaskService.func4({0x9f1460, 0xc000065080}, 0xc00002ee00)
/go/src/github.com/containerd/containerd/runtime/v2/task/shim.pb.go:3477 +0x98
github.com/containerd/ttrpc.defaultServerInterceptor({0x9f1460?, 0xc000065080?}, 0x17?, 0xc000134760?, 0x6?)
/go/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/interceptor.go:45 +0x26
github.com/containerd/ttrpc.(*serviceSet).dispatch(0xc000118650, {0x9f1460, 0xc000065080}, {0xc00002aba0, 0x17}, {0xc00027e22a, 0x6}, {0xc000180280, 0x42, 0x50})
/go/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/services.go:95 +0x1be
github.com/containerd/ttrpc.(*serviceSet).call(0xd0e330?, {0x9f1460?, 0xc000065080?}, {0xc00002aba0?, 0x84eac0?}, {0xc00027e22a?, 0xc000308000?}, {0xc000180280, 0x42, 0x50})
/go/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/services.go:64 +0x71
github.com/containerd/ttrpc.(*serverConn).run.func2(0x1b)
/go/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/server.go:439 +0xe5
created by github.com/containerd/ttrpc.(*serverConn).run
/go/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/server.go:435 +0x6b1
shim-goroutines (each shim goroutine is seperated by a BREAK)
Steps to reproduce the issue
Unfortunately, I don't know how, but this occurs across all nodes. I've tried different versions of runc, using systemd over cgroupfs, same issue.
We were also restarting containerd while booting (as we reconfigure some things at boot time), but even after removing the reboot, same behaviour.
Describe the results you received and expected
The main container should start pretty much straight after the init container finishes.
Kubelet PLEG relisting should not block for too long, nor should I get.
E0602 05:48:48.669179 1679 kubelet.go:2190] "Housekeeping took longer than 15s" err="housekeeping took too long" seconds=41.340653962
What version of containerd are you using?
containerd github.com/containerd/containerd v1.6.21 3dce8eb
Any other relevant information
runc version 1.1.7
commit: v1.1.7-0-g860f061b
spec: 1.0.2-dev
go: go1.20.3
libseccomp: 2.5.4
5.15.0-1037-aws #41~20.04.1-Ubuntu
Let me know what details you want from crictl info and I can provide (don't want to paste entire thing)
Show configuration if it is related to CRI plugin.
(I have used multiple configurations, below is the EKS config)
version = 2
root = "/var/lib/containerd"
state = "/run/containerd"
[grpc]
address = "/run/containerd/containerd.sock"
[plugins."io.containerd.grpc.v1.cri".containerd]
default_runtime_name = "runc"
[plugins."io.containerd.grpc.v1.cri".registry]
config_path = "/etc/containerd/certs.d:/etc/docker/certs.d"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
SystemdCgroup = true
[plugins."io.containerd.grpc.v1.cri".cni]
bin_dir = "/opt/cni/bin"
conf_dir = "/etc/cni/net.d"Metadata
Metadata
Assignees
Labels
Type
Projects
Status