fix(docker): bound NegotiateAPIVersion with configurable timeout (#608)#611
Conversation
`NewClientWithConfig` invoked `sdk.NegotiateAPIVersion(context.Background())` with an unbounded context. When the Docker daemon is reachable but unresponsive (e.g. a socket proxy whose upstream is wedged, or a remote daemon under heavy load), this call blocks forever and Ofelia hangs at startup with no diagnostic output. This was masked while the dialer hard-coded `/var/run/docker.sock`: the code path only ran against a local socket where the daemon either responded fast or failed the connection quickly. After #606, TCP setups (including socket proxies) actually reach the daemon, which makes the unbounded-context path much more reachable in practice. Add `NegotiateTimeout` to `ClientConfig` (default 30s) and wrap the negotiation in `context.WithTimeout`. NegotiateAPIVersion swallows ping errors silently, so the timeout only bounds the failure case; the successful path is unchanged. A new internal test starts a loopback `net.Listener` that accepts but never reads/writes, points `DOCKER_HOST` at it via `t.Setenv`, and asserts `NewClientWithConfig` returns within 5x the configured 200ms timeout (with a 5s hard safety net). Verified with `go test -race -count=3` and `golangci-lint run`. Fixes #608 Signed-off-by: Sebastian Mendel <[email protected]>
Dependency Review✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.Scanned FilesNone |
There was a problem hiding this comment.
Automated approval for maintainer PR
All automated quality gates passed. See SECURITY_CONTROLS.md for compensating controls.
✅ Mutation Testing ResultsMutation Score: 100.00% (threshold: 60%)
What is mutation testing?Mutation testing measures test quality by introducing small changes (mutations) to the code and checking if tests detect them. A higher score means better test effectiveness.
|
There was a problem hiding this comment.
Code Review
This pull request introduces a NegotiateTimeout setting to the Docker client configuration to prevent the application from hanging indefinitely during initial API version negotiation with an unresponsive daemon. It includes updates to the ClientConfig struct, the DefaultConfig function, and the NewClientWithConfig initialization logic, along with comprehensive regression tests. A review comment suggests improving maintainability by using DefaultConfig().NegotiateTimeout as the fallback value in NewClientWithConfig instead of hardcoding the 30-second duration again.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #611 +/- ##
==========================================
- Coverage 87.28% 87.23% -0.06%
==========================================
Files 88 88
Lines 10631 10643 +12
==========================================
+ Hits 9279 9284 +5
- Misses 1112 1118 +6
- Partials 240 241 +1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
This PR bounds Docker API version negotiation during client startup so Ofelia no longer hangs indefinitely when a reachable Docker daemon or socket proxy stops responding.
Changes:
- Adds
NegotiateTimeoutto Docker client configuration with a 30s default. - Wraps
NegotiateAPIVersionin a timeout-bound context. - Adds regression coverage for wedged negotiation and default timeout behavior.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
core/adapters/docker/client.go |
Adds configurable negotiation timeout and applies it during Docker client creation. |
core/adapters/docker/client_negotiate_timeout_test.go |
Adds tests validating bounded negotiation behavior and default timeout sanity. |
- Extract defaultNegotiateTimeout const so DefaultConfig and the <=0 fallback share one value. - Surface deadline-exceeded as a slog warning so operators see *why* startup is slow rather than chasing phantom bugs. NegotiateAPIVersion itself swallows ping errors silently, so ctx.Err() is the only signal. - Bump regression-test timeout 200ms -> 500ms with 10x return-window for CI scheduling jitter; matching headroom on the hard deadline. - Add TestNewClientWithConfig_NegotiateTimeoutFallback to pin the zero/negative -> defaultNegotiateTimeout fallback path. - Tighten TestDefaultConfig_NegotiateTimeout to assert the exact const. - CHANGELOG entry under [Unreleased] -> ### Fixed. Signed-off-by: Sebastian Mendel <[email protected]>
|
There was a problem hiding this comment.
Automated approval for maintainer PR
All automated quality gates passed. See SECURITY_CONTROLS.md for compensating controls.
The actual code fix for #605 cascaded into main via #613 (which added DOCKER_HOST env-driven host resolution to createHTTPClient as a side-effect of wiring TLS material). This commit adds the parts of PR #606 that did not duplicate code already on main: - TROUBLESHOOTING entry "DOCKER_HOST with TCP Socket Proxy (v0.12.0 - v0.24.0)" with the original symptom, root cause, and a working tecnativa/docker-socket-proxy compose snippet. - CHANGELOG attribution under [Unreleased] -> Fixed for #606/#605 alongside the existing #611/#612/#613/#618 entries, plus a Tests section noting the TestHealthStatus race fix that was also in #606. The TestHealthStatus race fix was cherry-picked separately as the preceding commit (it predates the recent batch and is unrelated to the docker adapter). Signed-off-by: Sebastian Mendel <[email protected]>



Summary
Fixes #608.
core/adapters/docker/client.go::NewClientWithConfigcalledsdk.NegotiateAPIVersion(context.Background())with an unbounded context. If the Docker daemon was reachable but unresponsive (e.g. a socket proxy whose upstream is wedged, or a remote daemon under heavy load),NewClientWithConfigblocked forever and Ofelia hung at startup with no diagnostic output.Root cause
NegotiateAPIVersionissues/_pingand waits on the response. Withcontext.Background()there is no timeout. The Docker SDK swallows ping errors silently, so a timeout does not change correctness on successful paths — it only bounds the failure case.Why this matters more now
This was masked while the dialer hard-coded
/var/run/docker.sock: the code path only ran against a local socket where the daemon either responded fast or failed the connection quickly. After #606, TCP setups (including socket proxies) actually reach the daemon, which makes the unbounded-context path far more reachable in practice.Fix
NegotiateTimeout time.DurationtoClientConfig(default30sinDefaultConfig).context.WithTimeout(context.Background(), config.NegotiateTimeout)with adefer cancel().ClientConfigand at the call site.No non-negotiation paths were changed.
Test plan
TestNewClientWithConfig_NegotiateAPIVersionTimeoutstarts a loopbacknet.Listenerthat accepts but never reads/writes (simulating a wedged daemon), pointsDOCKER_HOSTat it viat.Setenv, setsNegotiateTimeout=200ms, and assertsNewClientWithConfigreturns within 5x the configured timeout, with a 5s hard safety net.TestDefaultConfig_NegotiateTimeoutasserts the default is non-zero and bounded.go test -race -count=3 ./core/adapters/docker/ ./core/— all green.golangci-lint run --timeout=3m— 0 issues.Relationship to #606
#606 made TCP / socket-proxy paths actually reach the daemon, which unmasked this latent hang. This PR closes that exposure without touching anything in the dialer / transport path.