Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fatal errors: TTRPC metadata concurrent map operations #11138

Closed
just1not2 opened this issue Dec 11, 2024 · 1 comment · Fixed by #11241
Closed

Fatal errors: TTRPC metadata concurrent map operations #11138

just1not2 opened this issue Dec 11, 2024 · 1 comment · Fixed by #11241

Comments

@just1not2
Copy link

Description

When migrating Containerd from v2.0.0.rc-4 to v2.0.0.rc-5 on the nodes of a large Kubernetes cluster, a lot of fatal errors like this one started to appear and make Containerd crash:

Dec 10 15:24:57 <HOSTNAME> containerd[32960]: fatal error: concurrent map writes
Dec 10 15:24:57 <HOSTNAME> containerd[32960]: goroutine 164 [running]:
Dec 10 15:24:57 <HOSTNAME> containerd[32960]: github.com/containerd/ttrpc.MD.Set(...)
Dec 10 15:24:57 <HOSTNAME> containerd[32960]:         /go/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/metadata.go:48
Dec 10 15:24:57 <HOSTNAME> containerd[32960]: github.com/containerd/otelttrpc.(*metadataSupplier).Set(0x0?, {0x5c66b02ac976, 0xb}, {0xc00087e640, 0x37})
Dec 10 15:24:57 <HOSTNAME> containerd[32960]:         /go/src/github.com/containerd/containerd/vendor/github.com/containerd/otelttrpc/metadata_supplier.go:58 +0x91
Dec 10 15:24:57 <HOSTNAME> containerd[32960]: go.opentelemetry.io/otel/propagation.TraceContext.Inject({}, {0x5c66b0a036e8?, 0xc000a08a20?}, {0x5c66b09fb520, 0xc0008be0e0})
Dec 10 15:24:57 <HOSTNAME> containerd[32960]:         /go/src/github.com/containerd/containerd/vendor/go.opentelemetry.io/otel/propagation/trace_context.go:64 +0x747
Dec 10 15:24:57 <HOSTNAME> containerd[32960]: go.opentelemetry.io/otel/propagation.compositeTextMapPropagator.Inject(...)
Dec 10 15:24:57 <HOSTNAME> containerd[32960]:         /go/src/github.com/containerd/containerd/vendor/go.opentelemetry.io/otel/propagation/propagation.go:106
Dec 10 15:24:57 <HOSTNAME> containerd[32960]: go.opentelemetry.io/otel/internal/global.(*textMapPropagator).Inject(0xc0000ba460?, {0x5c66b0a036e8, 0xc000a08a20}, {0x5c66b09fb520, 0xc0008be0e0})

The concurrent map operation that are occurring always come from the containerd/ttrpc dependency, and are triggered by all possible configurations: r/w, w/w and iteration/w.

After bisecting on v2.0.0-rc.4...v2.0.0-rc.5, here is the culprit PR: #10186

Steps to reproduce the issue

I could not find a simple reproducer for this issue inasmuch as it comes from concurrent operations running randomly at the same time. On my side, deploying v2-rc5 on a ~30-node cluster with ~10 pods per node consistently triggers the issue (but again, it might depend on many other factors...): there are at all time around 2 nodes that are failing due to those containerd failures.

Describe the results you received and expected

Here are a few examples:

Dec 10 15:24:57 <HOSTNAME> containerd[32960]: fatal error: concurrent map writes
Dec 10 15:24:57 <HOSTNAME> containerd[32960]: goroutine 164 [running]:
Dec 10 15:24:57 <HOSTNAME> containerd[32960]: github.com/containerd/ttrpc.MD.Set(...)
Dec 10 15:24:57 <HOSTNAME> containerd[32960]:         /go/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/metadata.go:48
Dec 10 15:24:57 <HOSTNAME> containerd[32960]: github.com/containerd/otelttrpc.(*metadataSupplier).Set(0x0?, {0x5c66b02ac976, 0xb}, {0xc00087e640, 0x37})
Dec 10 15:24:57 <HOSTNAME> containerd[32960]:         /go/src/github.com/containerd/containerd/vendor/github.com/containerd/otelttrpc/metadata_supplier.go:58 +0x91
Dec 10 15:24:57 <HOSTNAME> containerd[32960]: go.opentelemetry.io/otel/propagation.TraceContext.Inject({}, {0x5c66b0a036e8?, 0xc000a08a20?}, {0x5c66b09fb520, 0xc0008be0e0})
Dec 10 15:24:57 <HOSTNAME> containerd[32960]:         /go/src/github.com/containerd/containerd/vendor/go.opentelemetry.io/otel/propagation/trace_context.go:64 +0x747
Dec 10 15:24:57 <HOSTNAME> containerd[32960]: go.opentelemetry.io/otel/propagation.compositeTextMapPropagator.Inject(...)
Dec 10 15:24:57 <HOSTNAME> containerd[32960]:         /go/src/github.com/containerd/containerd/vendor/go.opentelemetry.io/otel/propagation/propagation.go:106
Dec 10 15:24:57 <HOSTNAME> containerd[32960]: go.opentelemetry.io/otel/internal/global.(*textMapPropagator).Inject(0xc0000ba460?, {0x5c66b0a036e8, 0xc000a08a20}, {0x5c66b09fb520, 0xc0008be0e0})
Dec 10 15:24:52 <HOSTNAME> containerd[17813]: fatal error: concurrent map iteration and map write
Dec 10 15:24:52 <HOSTNAME> containerd[17813]: goroutine 20431 [running]:
Dec 10 15:24:52 <HOSTNAME> containerd[17813]: github.com/containerd/ttrpc.MD.setRequest(...)
Dec 10 15:24:52 <HOSTNAME> containerd[17813]:         /go/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/metadata.go:66
Dec 10 15:24:52 <HOSTNAME> containerd[17813]: github.com/containerd/ttrpc.(*Client).Call(0xc000619290, {0x62c01be746e8, 0xc0013eee70}, {0x62c01b7364d4, 0x17}, {0x62c01b711e19, 0x4}, {0x62c01bcfa7a0?, 0xc0010c61e0?}, {0x62c01bcfa860, ...})
Dec 10 15:24:52 <HOSTNAME> containerd[17813]:         /go/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/client.go:163 +0x1a5
Dec 10 15:24:52 <HOSTNAME> containerd[17813]: github.com/containerd/containerd/api/runtime/task/v3.(*ttrpctaskClient).Wait(0xc000122ac0, {0x62c01be746e8, 0xc0013eee70}, 0xc0010c61e0)
Dec 10 15:24:52 <HOSTNAME> containerd[17813]:         /go/src/github.com/containerd/containerd/vendor/github.com/containerd/containerd/api/runtime/task/v3/shim_ttrpc.pb.go:273 +0x92
Dec 10 15:24:52 <HOSTNAME> containerd[17813]: github.com/containerd/containerd/v2/core/runtime/v2.(*process).Wait(0xc0014c6e88, {0x62c01be746e8, 0xc0013eee70})
Dec 10 15:24:52 <HOSTNAME> containerd[17813]:         /go/src/github.com/containerd/containerd/core/runtime/v2/process.go:133 +0xbf
Dec 10 15:24:52 <HOSTNAME> containerd[17813]: github.com/containerd/containerd/v2/plugins/services/tasks.(*local).Wait(0x62c01be74720?, {0x62c01be746e8, 0xc0013eee70}, 0xc000fa54a0, {0xc00132dfb0?, 0x2?, 0x62c01cc60250?})
Dec 10 15:24:52 <HOSTNAME> containerd[17813]:         /go/src/github.com/containerd/containerd/plugins/services/tasks/local.go:633 +0xde
Dec 10 15:24:52 <HOSTNAME> containerd[17813]: github.com/containerd/containerd/v2/client.(*process).Wait.func1()
Dec 10 15:24:52 <HOSTNAME> containerd[17813]:         /go/src/github.com/containerd/containerd/client/process.go:175 +0x27d
Dec 10 15:24:52 <HOSTNAME> containerd[17813]: created by github.com/containerd/containerd/v2/client.(*process).Wait in goroutine 20426
Dec 10 15:24:52 <HOSTNAME> containerd[17813]:         /go/src/github.com/containerd/containerd/client/process.go:168 +0xa5
Dec 10 15:24:52 <HOSTNAME> containerd[17813]: goroutine 1 [chan receive, 17 minutes]:
Dec 10 13:30:57 <HOSTNAME> containerd[837]: fatal error: concurrent map iteration and map write
Dec 10 13:30:57 <HOSTNAME> containerd[837]: goroutine 28835 [running]:
Dec 10 13:30:57 <HOSTNAME> containerd[837]: github.com/containerd/otelttrpc.inject({0x55d4b49606e8, 0xc0013bc180}, {0x55d4b4958100, 0xc0001b5ad0}, 0xc000f14100)
Dec 10 13:30:57 <HOSTNAME> containerd[837]:         /go/src/github.com/containerd/containerd/vendor/github.com/containerd/otelttrpc/metadata_supplier.go:86 +0x2e5
Dec 10 13:30:57 <HOSTNAME> containerd[837]: github.com/containerd/otelttrpc.UnaryClientInterceptor.func1({0x55d4b49606e8, 0xc000e86300}, 0xc000f14100, 0xc00011c640, 0x55d4b47e39e0?, 0xc00187c020)
Dec 10 13:30:57 <HOSTNAME> containerd[837]:         /go/src/github.com/containerd/containerd/vendor/github.com/containerd/otelttrpc/interceptor.go:98 +0x285
Dec 10 13:30:57 <HOSTNAME> containerd[837]: github.com/containerd/ttrpc.(*Client).Call(0xc0008f2000, {0x55d4b49606e8, 0xc000e86300}, {0x55d4b42224d4, 0x17}, {0x55d4b41fdedd, 0x4}, {0x55d4b482e5c0?, 0xc00011c5f0?}, {0x55d4b4775020, ...})
Dec 10 13:30:57 <HOSTNAME> containerd[837]:         /go/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/client.go:173 +0x323
Dec 10 13:30:57 <HOSTNAME> containerd[837]: github.com/containerd/containerd/api/runtime/task/v3.(*ttrpctaskClient).Kill(0xc000ea4030, {0x55d4b49606e8, 0xc000e86300}, 0xc00011c5f0)
Dec 10 13:30:57 <HOSTNAME> containerd[837]:         /go/src/github.com/containerd/containerd/vendor/github.com/containerd/containerd/api/runtime/task/v3/shim_ttrpc.pb.go:233 +0x92
Dec 10 13:30:57 <HOSTNAME> containerd[837]: github.com/containerd/containerd/v2/core/runtime/v2.(*process).Kill(0xc001986210, {0x55d4b49606e8, 0xc000e86300}, 0x9, 0xc0?)
Dec 10 13:30:57 <HOSTNAME> containerd[837]:         /go/src/github.com/containerd/containerd/core/runtime/v2/process.go:41 +0xc7
Dec 10 13:30:57 <HOSTNAME> containerd[837]: github.com/containerd/containerd/v2/plugins/services/tasks.(*local).Kill(0x55d4b4960720?, {0x55d4b49606e8, 0xc000e86300}, 0xc00191a1e0, {0xc000f70828?, 0x3?, 0x500708?})
Dec 10 13:30:57 <HOSTNAME> containerd[837]:         /go/src/github.com/containerd/containerd/plugins/services/tasks/local.go:443 +0xbf
Dec 10 13:30:57 <HOSTNAME> containerd[837]: github.com/containerd/containerd/v2/client.(*process).Kill(0xc001838bd0, {0x55d4b4960720, 0xc00191a000}, 0x9, {0xc000ea4028, 0x1, 0x0?})
Dec 10 13:30:57 <HOSTNAME> containerd[837]:         /go/src/github.com/containerd/containerd/client/process.go:157 +0x394
Dec 10 13:30:57 <HOSTNAME> containerd[837]: github.com/containerd/containerd/v2/client.WithProcessKill({0x55d4b49606e8, 0xc000e86240}, {0x55d4b496b950, 0xc001838bd0})
Dec 10 13:30:57 <HOSTNAME> containerd[837]:         /go/src/github.com/containerd/containerd/client/task_opts.go:168 +0x10e

What version of containerd are you using?

containerd v2-rc5 (the issue also occurs on v2.0.0)

Any other relevant information

As far as I understand the issue, it may be a good idea to make the ttrpc.MD object safer by preventing packages that are importing it from accessing its content directly and adding a RW mutex.
Here is a fix that I did on my side that fixes the issue: containerd/ttrpc#176
Feel free to let me know what you think

Show configuration if it is related to CRI plugin.

No response

@dosubot dosubot bot added the area/runtime Runtime label Dec 11, 2024
@just1not2 just1not2 changed the title TTRPC metadata concurrent map operations Fatal errors: TTRPC metadata concurrent map operations Dec 19, 2024
@djdongjin
Copy link
Member

fyi this has been fixed in otelttrpc (containerd/otelttrpc#2). Once we have an otelttrpc release (should be soon) we can bump up it in containerd.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants