You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When migrating Containerd from v2.0.0.rc-4 to v2.0.0.rc-5 on the nodes of a large Kubernetes cluster, a lot of fatal errors like this one started to appear and make Containerd crash:
Dec 10 15:24:57 <HOSTNAME> containerd[32960]: fatal error: concurrent map writes
Dec 10 15:24:57 <HOSTNAME> containerd[32960]: goroutine 164 [running]:
Dec 10 15:24:57 <HOSTNAME> containerd[32960]: github.com/containerd/ttrpc.MD.Set(...)
Dec 10 15:24:57 <HOSTNAME> containerd[32960]: /go/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/metadata.go:48
Dec 10 15:24:57 <HOSTNAME> containerd[32960]: github.com/containerd/otelttrpc.(*metadataSupplier).Set(0x0?, {0x5c66b02ac976, 0xb}, {0xc00087e640, 0x37})
Dec 10 15:24:57 <HOSTNAME> containerd[32960]: /go/src/github.com/containerd/containerd/vendor/github.com/containerd/otelttrpc/metadata_supplier.go:58 +0x91
Dec 10 15:24:57 <HOSTNAME> containerd[32960]: go.opentelemetry.io/otel/propagation.TraceContext.Inject({}, {0x5c66b0a036e8?, 0xc000a08a20?}, {0x5c66b09fb520, 0xc0008be0e0})
Dec 10 15:24:57 <HOSTNAME> containerd[32960]: /go/src/github.com/containerd/containerd/vendor/go.opentelemetry.io/otel/propagation/trace_context.go:64 +0x747
Dec 10 15:24:57 <HOSTNAME> containerd[32960]: go.opentelemetry.io/otel/propagation.compositeTextMapPropagator.Inject(...)
Dec 10 15:24:57 <HOSTNAME> containerd[32960]: /go/src/github.com/containerd/containerd/vendor/go.opentelemetry.io/otel/propagation/propagation.go:106
Dec 10 15:24:57 <HOSTNAME> containerd[32960]: go.opentelemetry.io/otel/internal/global.(*textMapPropagator).Inject(0xc0000ba460?, {0x5c66b0a036e8, 0xc000a08a20}, {0x5c66b09fb520, 0xc0008be0e0})
The concurrent map operation that are occurring always come from the containerd/ttrpc dependency, and are triggered by all possible configurations: r/w, w/w and iteration/w.
I could not find a simple reproducer for this issue inasmuch as it comes from concurrent operations running randomly at the same time. On my side, deploying v2-rc5 on a ~30-node cluster with ~10 pods per node consistently triggers the issue (but again, it might depend on many other factors...): there are at all time around 2 nodes that are failing due to those containerd failures.
Describe the results you received and expected
Here are a few examples:
Dec 10 15:24:57 <HOSTNAME> containerd[32960]: fatal error: concurrent map writes
Dec 10 15:24:57 <HOSTNAME> containerd[32960]: goroutine 164 [running]:
Dec 10 15:24:57 <HOSTNAME> containerd[32960]: github.com/containerd/ttrpc.MD.Set(...)
Dec 10 15:24:57 <HOSTNAME> containerd[32960]: /go/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/metadata.go:48
Dec 10 15:24:57 <HOSTNAME> containerd[32960]: github.com/containerd/otelttrpc.(*metadataSupplier).Set(0x0?, {0x5c66b02ac976, 0xb}, {0xc00087e640, 0x37})
Dec 10 15:24:57 <HOSTNAME> containerd[32960]: /go/src/github.com/containerd/containerd/vendor/github.com/containerd/otelttrpc/metadata_supplier.go:58 +0x91
Dec 10 15:24:57 <HOSTNAME> containerd[32960]: go.opentelemetry.io/otel/propagation.TraceContext.Inject({}, {0x5c66b0a036e8?, 0xc000a08a20?}, {0x5c66b09fb520, 0xc0008be0e0})
Dec 10 15:24:57 <HOSTNAME> containerd[32960]: /go/src/github.com/containerd/containerd/vendor/go.opentelemetry.io/otel/propagation/trace_context.go:64 +0x747
Dec 10 15:24:57 <HOSTNAME> containerd[32960]: go.opentelemetry.io/otel/propagation.compositeTextMapPropagator.Inject(...)
Dec 10 15:24:57 <HOSTNAME> containerd[32960]: /go/src/github.com/containerd/containerd/vendor/go.opentelemetry.io/otel/propagation/propagation.go:106
Dec 10 15:24:57 <HOSTNAME> containerd[32960]: go.opentelemetry.io/otel/internal/global.(*textMapPropagator).Inject(0xc0000ba460?, {0x5c66b0a036e8, 0xc000a08a20}, {0x5c66b09fb520, 0xc0008be0e0})
Dec 10 15:24:52 <HOSTNAME> containerd[17813]: fatal error: concurrent map iteration and map write
Dec 10 15:24:52 <HOSTNAME> containerd[17813]: goroutine 20431 [running]:
Dec 10 15:24:52 <HOSTNAME> containerd[17813]: github.com/containerd/ttrpc.MD.setRequest(...)
Dec 10 15:24:52 <HOSTNAME> containerd[17813]: /go/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/metadata.go:66
Dec 10 15:24:52 <HOSTNAME> containerd[17813]: github.com/containerd/ttrpc.(*Client).Call(0xc000619290, {0x62c01be746e8, 0xc0013eee70}, {0x62c01b7364d4, 0x17}, {0x62c01b711e19, 0x4}, {0x62c01bcfa7a0?, 0xc0010c61e0?}, {0x62c01bcfa860, ...})
Dec 10 15:24:52 <HOSTNAME> containerd[17813]: /go/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/client.go:163 +0x1a5
Dec 10 15:24:52 <HOSTNAME> containerd[17813]: github.com/containerd/containerd/api/runtime/task/v3.(*ttrpctaskClient).Wait(0xc000122ac0, {0x62c01be746e8, 0xc0013eee70}, 0xc0010c61e0)
Dec 10 15:24:52 <HOSTNAME> containerd[17813]: /go/src/github.com/containerd/containerd/vendor/github.com/containerd/containerd/api/runtime/task/v3/shim_ttrpc.pb.go:273 +0x92
Dec 10 15:24:52 <HOSTNAME> containerd[17813]: github.com/containerd/containerd/v2/core/runtime/v2.(*process).Wait(0xc0014c6e88, {0x62c01be746e8, 0xc0013eee70})
Dec 10 15:24:52 <HOSTNAME> containerd[17813]: /go/src/github.com/containerd/containerd/core/runtime/v2/process.go:133 +0xbf
Dec 10 15:24:52 <HOSTNAME> containerd[17813]: github.com/containerd/containerd/v2/plugins/services/tasks.(*local).Wait(0x62c01be74720?, {0x62c01be746e8, 0xc0013eee70}, 0xc000fa54a0, {0xc00132dfb0?, 0x2?, 0x62c01cc60250?})
Dec 10 15:24:52 <HOSTNAME> containerd[17813]: /go/src/github.com/containerd/containerd/plugins/services/tasks/local.go:633 +0xde
Dec 10 15:24:52 <HOSTNAME> containerd[17813]: github.com/containerd/containerd/v2/client.(*process).Wait.func1()
Dec 10 15:24:52 <HOSTNAME> containerd[17813]: /go/src/github.com/containerd/containerd/client/process.go:175 +0x27d
Dec 10 15:24:52 <HOSTNAME> containerd[17813]: created by github.com/containerd/containerd/v2/client.(*process).Wait in goroutine 20426
Dec 10 15:24:52 <HOSTNAME> containerd[17813]: /go/src/github.com/containerd/containerd/client/process.go:168 +0xa5
Dec 10 15:24:52 <HOSTNAME> containerd[17813]: goroutine 1 [chan receive, 17 minutes]:
Dec 10 13:30:57 <HOSTNAME> containerd[837]: fatal error: concurrent map iteration and map write
Dec 10 13:30:57 <HOSTNAME> containerd[837]: goroutine 28835 [running]:
Dec 10 13:30:57 <HOSTNAME> containerd[837]: github.com/containerd/otelttrpc.inject({0x55d4b49606e8, 0xc0013bc180}, {0x55d4b4958100, 0xc0001b5ad0}, 0xc000f14100)
Dec 10 13:30:57 <HOSTNAME> containerd[837]: /go/src/github.com/containerd/containerd/vendor/github.com/containerd/otelttrpc/metadata_supplier.go:86 +0x2e5
Dec 10 13:30:57 <HOSTNAME> containerd[837]: github.com/containerd/otelttrpc.UnaryClientInterceptor.func1({0x55d4b49606e8, 0xc000e86300}, 0xc000f14100, 0xc00011c640, 0x55d4b47e39e0?, 0xc00187c020)
Dec 10 13:30:57 <HOSTNAME> containerd[837]: /go/src/github.com/containerd/containerd/vendor/github.com/containerd/otelttrpc/interceptor.go:98 +0x285
Dec 10 13:30:57 <HOSTNAME> containerd[837]: github.com/containerd/ttrpc.(*Client).Call(0xc0008f2000, {0x55d4b49606e8, 0xc000e86300}, {0x55d4b42224d4, 0x17}, {0x55d4b41fdedd, 0x4}, {0x55d4b482e5c0?, 0xc00011c5f0?}, {0x55d4b4775020, ...})
Dec 10 13:30:57 <HOSTNAME> containerd[837]: /go/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/client.go:173 +0x323
Dec 10 13:30:57 <HOSTNAME> containerd[837]: github.com/containerd/containerd/api/runtime/task/v3.(*ttrpctaskClient).Kill(0xc000ea4030, {0x55d4b49606e8, 0xc000e86300}, 0xc00011c5f0)
Dec 10 13:30:57 <HOSTNAME> containerd[837]: /go/src/github.com/containerd/containerd/vendor/github.com/containerd/containerd/api/runtime/task/v3/shim_ttrpc.pb.go:233 +0x92
Dec 10 13:30:57 <HOSTNAME> containerd[837]: github.com/containerd/containerd/v2/core/runtime/v2.(*process).Kill(0xc001986210, {0x55d4b49606e8, 0xc000e86300}, 0x9, 0xc0?)
Dec 10 13:30:57 <HOSTNAME> containerd[837]: /go/src/github.com/containerd/containerd/core/runtime/v2/process.go:41 +0xc7
Dec 10 13:30:57 <HOSTNAME> containerd[837]: github.com/containerd/containerd/v2/plugins/services/tasks.(*local).Kill(0x55d4b4960720?, {0x55d4b49606e8, 0xc000e86300}, 0xc00191a1e0, {0xc000f70828?, 0x3?, 0x500708?})
Dec 10 13:30:57 <HOSTNAME> containerd[837]: /go/src/github.com/containerd/containerd/plugins/services/tasks/local.go:443 +0xbf
Dec 10 13:30:57 <HOSTNAME> containerd[837]: github.com/containerd/containerd/v2/client.(*process).Kill(0xc001838bd0, {0x55d4b4960720, 0xc00191a000}, 0x9, {0xc000ea4028, 0x1, 0x0?})
Dec 10 13:30:57 <HOSTNAME> containerd[837]: /go/src/github.com/containerd/containerd/client/process.go:157 +0x394
Dec 10 13:30:57 <HOSTNAME> containerd[837]: github.com/containerd/containerd/v2/client.WithProcessKill({0x55d4b49606e8, 0xc000e86240}, {0x55d4b496b950, 0xc001838bd0})
Dec 10 13:30:57 <HOSTNAME> containerd[837]: /go/src/github.com/containerd/containerd/client/task_opts.go:168 +0x10e
What version of containerd are you using?
containerd v2-rc5 (the issue also occurs on v2.0.0)
Any other relevant information
As far as I understand the issue, it may be a good idea to make the ttrpc.MD object safer by preventing packages that are importing it from accessing its content directly and adding a RW mutex.
Here is a fix that I did on my side that fixes the issue: containerd/ttrpc#176
Feel free to let me know what you think
Show configuration if it is related to CRI plugin.
No response
The text was updated successfully, but these errors were encountered:
Description
When migrating Containerd from v2.0.0.rc-4 to v2.0.0.rc-5 on the nodes of a large Kubernetes cluster, a lot of fatal errors like this one started to appear and make Containerd crash:
The concurrent map operation that are occurring always come from the containerd/ttrpc dependency, and are triggered by all possible configurations: r/w, w/w and iteration/w.
After bisecting on v2.0.0-rc.4...v2.0.0-rc.5, here is the culprit PR: #10186
Steps to reproduce the issue
I could not find a simple reproducer for this issue inasmuch as it comes from concurrent operations running randomly at the same time. On my side, deploying v2-rc5 on a ~30-node cluster with ~10 pods per node consistently triggers the issue (but again, it might depend on many other factors...): there are at all time around 2 nodes that are failing due to those containerd failures.
Describe the results you received and expected
Here are a few examples:
What version of containerd are you using?
containerd v2-rc5 (the issue also occurs on v2.0.0)
Any other relevant information
As far as I understand the issue, it may be a good idea to make the
ttrpc.MD
object safer by preventing packages that are importing it from accessing its content directly and adding a RW mutex.Here is a fix that I did on my side that fixes the issue: containerd/ttrpc#176
Feel free to let me know what you think
Show configuration if it is related to CRI plugin.
No response
The text was updated successfully, but these errors were encountered: