fix: bound remaining unbounded Docker context calls (#614)#636
Conversation
Three Docker SDK call sites still passed context.Background() and would hang Ofelia when the daemon was reachable but wedged (e.g. socket proxy with a stalled upstream). The audit on PR #611 / issue #608 surfaced them; this commit completes the cleanup: | File:Line | Call | Symptom | | --- | --- | --- | | web/health.go:checkDocker | Ping, Info | /health and /ready hang; monitoring blind to wedged daemon | | cli/docker_config_handler.go (NewDockerHandler post-construct + buildSDKProvider) | Ping x2 | Daemon startup hangs at sanity check | | cli/doctor.go (checkDocker, checkDockerImages) | Ping, HasImageLocally | The diagnostic command meant to surface this hang itself hangs | Each call is now wrapped in context.WithTimeout(parentCtx, N) with defer-style cancel: - 5s for health-check (operators expect responsive monitoring within one scrape interval) - 10s for startup sanity ping (daemon boot is one-shot, generous) - 5s per call for doctor (per-image, so a slow daemon with many images is not falsely failed by a single overall budget) Note: web/health.go's checkDocker runs from the periodic ticker, not the HTTP handler, so the parent context is context.Background() rather than r.Context() as the issue text suggested. The HTTP handlers read pre-computed cached state and never call into Docker themselves. Also fixes a closely-related fourth unbounded Ping at NewDockerHandler's own post-construction sanity check (same construction-time wedge, two lines of code apart). Tracked as part of #614 because the audit's interpretation of the issue covered both Pings. Plumbs SDKDockerProviderConfig.NegotiateTimeout to forward the test-friendly negotiation bound from #611 one layer up, so the construction-time tests can run in seconds instead of waiting on the 30s default. Adds a newSDKDockerProvider package-level factory so tests can swap in a hanging stub - mirrors the existing newDockerHandler seam. Tests: each fix site has a regression test that exercises the wedge with a hanging mock provider (parallel-safe - no t.Setenv, no env var contamination per PR #606 review feedback) and asserts the call returns within the bounded window. Each test was confirmed to fail (test timeout fires) on the pre-fix code and pass on the fixed code. Fixes #614 Signed-off-by: Sebastian Mendel <[email protected]>
Dependency Review✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.Scanned FilesNone |
There was a problem hiding this comment.
Automated approval for maintainer PR
All automated quality gates passed. See SECURITY_CONTROLS.md for compensating controls.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #636 +/- ##
==========================================
+ Coverage 87.38% 87.68% +0.30%
==========================================
Files 88 88
Lines 10722 10739 +17
==========================================
+ Hits 9369 9417 +48
+ Misses 1112 1084 -28
+ Partials 241 238 -3
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
✅ Mutation Testing ResultsMutation Score: 100.00% (threshold: 60%)
What is mutation testing?Mutation testing measures test quality by introducing small changes (mutations) to the code and checking if tests detect them. A higher score means better test effectiveness.
|
There was a problem hiding this comment.
Pull request overview
Wraps the three remaining unbounded context.Background() Docker SDK call sites (web health checker, daemon-startup sanity Pings, and ofelia doctor) in context.WithTimeout, so a reachable-but-wedged Docker daemon can no longer hang Ofelia or its /health, /ready, and doctor diagnostics. Completes the audit started in #608/#611.
Changes:
- Add per-call timeouts:
dockerHealthCheckTimeout(5s) inweb/health.go,dockerStartupPingTimeout(10s) incli/docker_config_handler.go,doctorDockerCallTimeout(5s) incli/doctor.go. - Introduce
SDKDockerProviderConfig.NegotiateTimeoutand anewSDKDockerProviderpackage-level factory as test seams. - Add five regression tests using hanging mock providers that verify each fix site returns within its bound.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| web/health.go | Bound Ping/Info calls in periodic health checker with 5s timeouts |
| web/health_docker_timeout_test.go | Regression test asserting checkDocker returns within bound on wedged provider |
| cli/docker_config_handler.go | Bound startup sanity Pings (10s) in NewDockerHandler and buildSDKProvider; add newSDKDockerProvider test seam |
| cli/docker_config_handler_timeout_test.go | Regression tests for both sanity-Ping bound sites |
| cli/doctor.go | Bound Ping and per-image HasImageLocally calls (5s) in doctor diagnostic |
| cli/doctor_docker_timeout_test.go | Regression tests for checkDocker/checkDockerImages bounds with hanging stub |
| core/docker_sdk_provider.go | Add optional NegotiateTimeout field plumbed into adapter ClientConfig |
| CHANGELOG.md | Document the fix and the new optional config field under [Unreleased] → Fixed |
There was a problem hiding this comment.
Code Review
This pull request addresses issues where unbounded Docker SDK calls could cause Ofelia to hang when interacting with a reachable-but-wedged Docker daemon. The changes introduce bounded contexts with specific timeouts across the startup sanity checks, the 'ofelia doctor' diagnostic tool, and the background health checker. Additionally, the PR adds test coverage for these timeout scenarios and exposes a 'NegotiateTimeout' configuration for the SDK provider to improve testability. The review feedback suggests improving context propagation in 'buildSDKProvider' by utilizing the parent context from 'NewDockerHandler' instead of 'context.Background()'.
- buildSDKProvider now derives its bounded ping ctx from c.ctx (not context.Background) so SIGINT during startup also cancels the sanity ping (code review IMPORTANT). - Bump expectWithin in startup-ping tests from 15s -> 20s for CI scheduling headroom (test review nit; spec was 10s timeout, 2x floor). - TROUBLESHOOTING: new "Docker Daemon Wedged" entry covering the new behaviour — what used to be an indefinite hang is now a loud context-deadline failure (DX review gap). - CHANGELOG: clarify that timeouts are unexported constants, document the per-call vs overall trade-off in doctor's HasImageLocally, and note c.ctx propagation (DX + code review notes). Signed-off-by: Sebastian Mendel <[email protected]>
|
There was a problem hiding this comment.
Automated approval for maintainer PR
All automated quality gates passed. See SECURITY_CONTROLS.md for compensating controls.



Summary
Wraps every remaining
context.Background()Docker SDK call surfaced by the audit on PR #611 / issue #608 incontext.WithTimeout. Without these, a reachable-but-wedged daemon (e.g. a socket proxy with a stalled upstream) hangs Ofelia at startup, blinds/healthand/ready, and stalls theofelia doctordiagnostic.Fixes #614.
Fix sites
web/health.go:checkDockerPing(ctx)+Info(ctx)withcontext.Background()dockerHealthCheckTimeout = 5s(per call)cli/docker_config_handler.go:NewDockerHandlerc.dockerProvider.Ping(ctx)post-constructiondockerStartupPingTimeout = 10scli/docker_config_handler.go:buildSDKProviderprovider.Ping(context.Background())dockerStartupPingTimeout = 10scli/doctor.go:checkDockerprovider.Ping(context.Background())doctorDockerCallTimeout = 5scli/doctor.go:checkDockerImagesprovider.HasImageLocally(ctx, image)(loop)doctorDockerCallTimeout = 5sper callThe issue table mentioned only the
buildSDKProviderPing forcli/docker_config_handler.go. The closely-related second sanity Ping atNewDockerHandler(50 lines apart, same construction flow, same root cause) is fixed in the same PR per the task's "same line of code" guidance.Notes worth flagging for review
web/health.goparent context: the issue suggestedr.Context(), butcheckDockerruns from the periodic ticker (runPeriodicChecks), not the HTTP handler. The handlers read pre-computed cached state. Boundedcontext.Background()is the correct parent here.SDKDockerProviderConfig.NegotiateTimeout: new optional field plumbs the test-friendly negotiation bound from #611 one layer up, so the construction-time tests can run without waiting on the 30s default. Backwards compatible (zero falls back todefaultNegotiateTimeout).newSDKDockerProviderpackage-level factory (new): mirrors the existingnewDockerHandlertest seam so the line-183 regression test can swap in a hanging stub instead of dialling a real socket.Test plan
Each fix site has a regression test using a hanging mock provider whose context-bound calls block until
ctx.Done()(not.Setenv, no env-var contamination — addresses the parallel-test concern raised in PR #606 review). Each test was confirmed to fail on the pre-fix code (the test deadline fires) and pass on the fixed code.web.TestCheckDocker_BoundedTimeout— fails (30s timeout) → passes (~5s)cli.TestNewDockerHandler_PostConstructSanityPing_BoundedTimeout— fails (30s timeout) → passes (~10s)cli.TestBuildSDKProvider_PostConstructPing_BoundedTimeout— fails (30s timeout) → passes (~10s)cli.TestDoctor_CheckDocker_BoundedTimeout— fails (30s timeout) → passes (~5s)cli.TestDoctor_CheckDockerImages_BoundedTimeout— fails (30s timeout) → passes (~5s)go test -race ./web/ ./cli/ ./core/...cleangolangci-lint run --timeout=3mclean