🐛 Fix(manager): Prevent goroutine leak on shutdown timeout#3247
🐛 Fix(manager): Prevent goroutine leak on shutdown timeout#3247k8s-ci-robot merged 3 commits intokubernetes-sigs:mainfrom
Conversation
|
Hi @jingyih. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
pkg/manager/runnable_group.go
Outdated
| select { | ||
| case r.errChan <- err: | ||
| // Error sent successfully | ||
| case <-r.ctx.Done(): |
There was a problem hiding this comment.
Shutdown is triggered by cancelling the context and select is not ordered, i.E. if both clauses are valid, its not deterministic which it will use. Shouldn't this be a default instead to ensure we try sending on the errChan first?
There was a problem hiding this comment.
Good catch! We should try sending on the errChan first. I added some complexity here because using a simple "default" could cause errors to be dropped during normal operation if errChan is momentarily blocked.
|
LGTM label has been added. DetailsGit tree hash: f6b67196edd83d5e40548d8041fc169477b5a8d5 |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alvaroaleman, jingyih The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
| // During normal operation, always try to send errors (may block briefly) | ||
| r.errChan <- err |
There was a problem hiding this comment.
Wouldn't it be safer to use a select here with r.ctx.Done() to prevent blocking mentioned in the comment?
|
Thx! |
Fixes #3218
A race condition during manager shutdown can cause a goroutine leak if the graceful shutdown period is exceeded.
When the
gracefulShutdownTimeoutis reached, the manager stops waiting for runnables and terminates the goroutine responsible for draining errors from runnables. If a slow runnable subsequently fails, its attempt to send an error on the shared channel (errChan) will block its goroutine forever, causing a leak.This change prevents the leak by replacing the blocking error send in the runnableGroup with a non-blocking select statement. If the context is canceled, indicating a shutdown is in progress, the error is logged and dropped. This allows the runnable's goroutine to terminate cleanly instead of blocking forever.
Example test failure without this fix: