Skip to content

fix(docker): bound NegotiateAPIVersion with configurable timeout (#608)#611

Merged
CybotTM merged 2 commits into
mainfrom
fix/issue-608
May 13, 2026
Merged

fix(docker): bound NegotiateAPIVersion with configurable timeout (#608)#611
CybotTM merged 2 commits into
mainfrom
fix/issue-608

Conversation

@CybotTM

@CybotTM CybotTM commented May 13, 2026

Copy link
Copy Markdown
Member

Summary

Fixes #608. core/adapters/docker/client.go::NewClientWithConfig called sdk.NegotiateAPIVersion(context.Background()) with an unbounded context. If the Docker daemon was reachable but unresponsive (e.g. a socket proxy whose upstream is wedged, or a remote daemon under heavy load), NewClientWithConfig blocked forever and Ofelia hung at startup with no diagnostic output.

Root cause

NegotiateAPIVersion issues /_ping and waits on the response. With context.Background() there is no timeout. The Docker SDK swallows ping errors silently, so a timeout does not change correctness on successful paths — it only bounds the failure case.

Why this matters more now

This was masked while the dialer hard-coded /var/run/docker.sock: the code path only ran against a local socket where the daemon either responded fast or failed the connection quickly. After #606, TCP setups (including socket proxies) actually reach the daemon, which makes the unbounded-context path far more reachable in practice.

Fix

  • Add NegotiateTimeout time.Duration to ClientConfig (default 30s in DefaultConfig).
  • Wrap the negotiation in context.WithTimeout(context.Background(), config.NegotiateTimeout) with a defer cancel().
  • Document the field and the rationale inline on ClientConfig and at the call site.

No non-negotiation paths were changed.

Test plan

  • New internal test TestNewClientWithConfig_NegotiateAPIVersionTimeout starts a loopback net.Listener that accepts but never reads/writes (simulating a wedged daemon), points DOCKER_HOST at it via t.Setenv, sets NegotiateTimeout=200ms, and asserts NewClientWithConfig returns within 5x the configured timeout, with a 5s hard safety net.
  • New TestDefaultConfig_NegotiateTimeout asserts the default is non-zero and bounded.
  • go test -race -count=3 ./core/adapters/docker/ ./core/ — all green.
  • golangci-lint run --timeout=3m — 0 issues.
  • TDD: confirmed test failed to compile against current code (field did not exist) before applying the fix; passes in 0.21s after.

Relationship to #606

#606 made TCP / socket-proxy paths actually reach the daemon, which unmasked this latent hang. This PR closes that exposure without touching anything in the dialer / transport path.

`NewClientWithConfig` invoked `sdk.NegotiateAPIVersion(context.Background())`
with an unbounded context. When the Docker daemon is reachable but
unresponsive (e.g. a socket proxy whose upstream is wedged, or a remote
daemon under heavy load), this call blocks forever and Ofelia hangs at
startup with no diagnostic output.

This was masked while the dialer hard-coded `/var/run/docker.sock`:
the code path only ran against a local socket where the daemon either
responded fast or failed the connection quickly. After #606, TCP setups
(including socket proxies) actually reach the daemon, which makes the
unbounded-context path much more reachable in practice.

Add `NegotiateTimeout` to `ClientConfig` (default 30s) and wrap the
negotiation in `context.WithTimeout`. NegotiateAPIVersion swallows ping
errors silently, so the timeout only bounds the failure case; the
successful path is unchanged.

A new internal test starts a loopback `net.Listener` that accepts but
never reads/writes, points `DOCKER_HOST` at it via `t.Setenv`, and
asserts `NewClientWithConfig` returns within 5x the configured 200ms
timeout (with a 5s hard safety net). Verified with `go test -race
-count=3` and `golangci-lint run`.

Fixes #608

Signed-off-by: Sebastian Mendel <[email protected]>
Copilot AI review requested due to automatic review settings May 13, 2026 09:52
@github-actions github-actions Bot added the tests label May 13, 2026
@github-actions

github-actions Bot commented May 13, 2026

Copy link
Copy Markdown

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

Scanned Files

None

github-actions[bot]
github-actions Bot previously approved these changes May 13, 2026

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated approval for maintainer PR

All automated quality gates passed. See SECURITY_CONTROLS.md for compensating controls.

@github-actions

github-actions Bot commented May 13, 2026

Copy link
Copy Markdown

✅ Mutation Testing Results

Mutation Score: 100.00% (threshold: 60%)

✨ Good job! Mutation score meets the threshold.

What is mutation testing?

Mutation testing measures test quality by introducing small changes (mutations) to the code and checking if tests detect them. A higher score means better test effectiveness.

  • Killed mutants: Tests caught the mutation (good!)
  • Survived mutants: Tests missed the mutation (needs improvement)

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a NegotiateTimeout setting to the Docker client configuration to prevent the application from hanging indefinitely during initial API version negotiation with an unresponsive daemon. It includes updates to the ClientConfig struct, the DefaultConfig function, and the NewClientWithConfig initialization logic, along with comprehensive regression tests. A review comment suggests improving maintainability by using DefaultConfig().NegotiateTimeout as the fallback value in NewClientWithConfig instead of hardcoding the 30-second duration again.

Comment thread core/adapters/docker/client.go
@codecov

codecov Bot commented May 13, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 87.23%. Comparing base (90c1a86) to head (ca891f2).

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #611      +/-   ##
==========================================
- Coverage   87.28%   87.23%   -0.06%     
==========================================
  Files          88       88              
  Lines       10631    10643      +12     
==========================================
+ Hits         9279     9284       +5     
- Misses       1112     1118       +6     
- Partials      240      241       +1     
Flag Coverage Δ
integration 87.21% <100.00%> (-0.08%) ⬇️
unittests 83.80% <100.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR bounds Docker API version negotiation during client startup so Ofelia no longer hangs indefinitely when a reachable Docker daemon or socket proxy stops responding.

Changes:

  • Adds NegotiateTimeout to Docker client configuration with a 30s default.
  • Wraps NegotiateAPIVersion in a timeout-bound context.
  • Adds regression coverage for wedged negotiation and default timeout behavior.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
core/adapters/docker/client.go Adds configurable negotiation timeout and applies it during Docker client creation.
core/adapters/docker/client_negotiate_timeout_test.go Adds tests validating bounded negotiation behavior and default timeout sanity.

- Extract defaultNegotiateTimeout const so DefaultConfig and the <=0
  fallback share one value.
- Surface deadline-exceeded as a slog warning so operators see *why*
  startup is slow rather than chasing phantom bugs. NegotiateAPIVersion
  itself swallows ping errors silently, so ctx.Err() is the only signal.
- Bump regression-test timeout 200ms -> 500ms with 10x return-window for
  CI scheduling jitter; matching headroom on the hard deadline.
- Add TestNewClientWithConfig_NegotiateTimeoutFallback to pin the
  zero/negative -> defaultNegotiateTimeout fallback path.
- Tighten TestDefaultConfig_NegotiateTimeout to assert the exact const.
- CHANGELOG entry under [Unreleased] -> ### Fixed.

Signed-off-by: Sebastian Mendel <[email protected]>
@sonarqubecloud

Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated approval for maintainer PR

All automated quality gates passed. See SECURITY_CONTROLS.md for compensating controls.

@CybotTM CybotTM added this pull request to the merge queue May 13, 2026
Merged via the queue into main with commit efd2fee May 13, 2026
27 checks passed
@CybotTM CybotTM deleted the fix/issue-608 branch May 13, 2026 11:51
CybotTM added a commit that referenced this pull request May 13, 2026
The actual code fix for #605 cascaded into main via #613 (which added
DOCKER_HOST env-driven host resolution to createHTTPClient as a
side-effect of wiring TLS material). This commit adds the parts of
PR #606 that did not duplicate code already on main:

- TROUBLESHOOTING entry "DOCKER_HOST with TCP Socket Proxy
  (v0.12.0 - v0.24.0)" with the original symptom, root cause, and a
  working tecnativa/docker-socket-proxy compose snippet.
- CHANGELOG attribution under [Unreleased] -> Fixed for #606/#605
  alongside the existing #611/#612/#613/#618 entries, plus a Tests
  section noting the TestHealthStatus race fix that was also in #606.

The TestHealthStatus race fix was cherry-picked separately as the
preceding commit (it predates the recent batch and is unrelated to
the docker adapter).

Signed-off-by: Sebastian Mendel <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Docker SDK adapter: NegotiateAPIVersion uses unbounded context.Background()

2 participants