Skip to content

fix: bound remaining unbounded Docker context calls (#614)#636

Merged
CybotTM merged 2 commits into
mainfrom
fix/issue-614
May 13, 2026
Merged

fix: bound remaining unbounded Docker context calls (#614)#636
CybotTM merged 2 commits into
mainfrom
fix/issue-614

Conversation

@CybotTM

@CybotTM CybotTM commented May 13, 2026

Copy link
Copy Markdown
Member

Summary

Wraps every remaining context.Background() Docker SDK call surfaced by the audit on PR #611 / issue #608 in context.WithTimeout. Without these, a reachable-but-wedged daemon (e.g. a socket proxy with a stalled upstream) hangs Ofelia at startup, blinds /health and /ready, and stalls the ofelia doctor diagnostic.

Fixes #614.

Fix sites

File:func Original Now bounded by Rationale
web/health.go:checkDocker Ping(ctx) + Info(ctx) with context.Background() dockerHealthCheckTimeout = 5s (per call) Operators expect a non-2xx within one scrape interval
cli/docker_config_handler.go:NewDockerHandler c.dockerProvider.Ping(ctx) post-construction dockerStartupPingTimeout = 10s Daemon boot is one-shot; clear error within seconds
cli/docker_config_handler.go:buildSDKProvider provider.Ping(context.Background()) dockerStartupPingTimeout = 10s Same construction-time wedge
cli/doctor.go:checkDocker provider.Ping(context.Background()) doctorDockerCallTimeout = 5s The diagnostic itself must surface, not hang
cli/doctor.go:checkDockerImages provider.HasImageLocally(ctx, image) (loop) doctorDockerCallTimeout = 5s per call Per-call avoids falsely failing slow daemons with many images

The issue table mentioned only the buildSDKProvider Ping for cli/docker_config_handler.go. The closely-related second sanity Ping at NewDockerHandler (50 lines apart, same construction flow, same root cause) is fixed in the same PR per the task's "same line of code" guidance.

Notes worth flagging for review

  • web/health.go parent context: the issue suggested r.Context(), but checkDocker runs from the periodic ticker (runPeriodicChecks), not the HTTP handler. The handlers read pre-computed cached state. Bounded context.Background() is the correct parent here.
  • SDKDockerProviderConfig.NegotiateTimeout: new optional field plumbs the test-friendly negotiation bound from #611 one layer up, so the construction-time tests can run without waiting on the 30s default. Backwards compatible (zero falls back to defaultNegotiateTimeout).
  • newSDKDockerProvider package-level factory (new): mirrors the existing newDockerHandler test seam so the line-183 regression test can swap in a hanging stub instead of dialling a real socket.

Test plan

Each fix site has a regression test using a hanging mock provider whose context-bound calls block until ctx.Done() (no t.Setenv, no env-var contamination — addresses the parallel-test concern raised in PR #606 review). Each test was confirmed to fail on the pre-fix code (the test deadline fires) and pass on the fixed code.

  • web.TestCheckDocker_BoundedTimeout — fails (30s timeout) → passes (~5s)
  • cli.TestNewDockerHandler_PostConstructSanityPing_BoundedTimeout — fails (30s timeout) → passes (~10s)
  • cli.TestBuildSDKProvider_PostConstructPing_BoundedTimeout — fails (30s timeout) → passes (~10s)
  • cli.TestDoctor_CheckDocker_BoundedTimeout — fails (30s timeout) → passes (~5s)
  • cli.TestDoctor_CheckDockerImages_BoundedTimeout — fails (30s timeout) → passes (~5s)
  • go test -race ./web/ ./cli/ ./core/... clean
  • golangci-lint run --timeout=3m clean

Three Docker SDK call sites still passed context.Background() and would hang
Ofelia when the daemon was reachable but wedged (e.g. socket proxy with a
stalled upstream). The audit on PR #611 / issue #608 surfaced them; this
commit completes the cleanup:

| File:Line | Call | Symptom |
| --- | --- | --- |
| web/health.go:checkDocker | Ping, Info | /health and /ready hang; monitoring blind to wedged daemon |
| cli/docker_config_handler.go (NewDockerHandler post-construct + buildSDKProvider) | Ping x2 | Daemon startup hangs at sanity check |
| cli/doctor.go (checkDocker, checkDockerImages) | Ping, HasImageLocally | The diagnostic command meant to surface this hang itself hangs |

Each call is now wrapped in context.WithTimeout(parentCtx, N) with defer-style
cancel:

- 5s for health-check (operators expect responsive monitoring within one
  scrape interval)
- 10s for startup sanity ping (daemon boot is one-shot, generous)
- 5s per call for doctor (per-image, so a slow daemon with many images is not
  falsely failed by a single overall budget)

Note: web/health.go's checkDocker runs from the periodic ticker, not the
HTTP handler, so the parent context is context.Background() rather than
r.Context() as the issue text suggested. The HTTP handlers read pre-computed
cached state and never call into Docker themselves.

Also fixes a closely-related fourth unbounded Ping at NewDockerHandler's
own post-construction sanity check (same construction-time wedge, two lines
of code apart). Tracked as part of #614 because the audit's interpretation
of the issue covered both Pings.

Plumbs SDKDockerProviderConfig.NegotiateTimeout to forward the test-friendly
negotiation bound from #611 one layer up, so the construction-time tests can
run in seconds instead of waiting on the 30s default. Adds a
newSDKDockerProvider package-level factory so tests can swap in a hanging
stub - mirrors the existing newDockerHandler seam.

Tests: each fix site has a regression test that exercises the wedge with a
hanging mock provider (parallel-safe - no t.Setenv, no env var contamination
per PR #606 review feedback) and asserts the call returns within the bounded
window. Each test was confirmed to fail (test timeout fires) on the pre-fix
code and pass on the fixed code.

Fixes #614

Signed-off-by: Sebastian Mendel <[email protected]>
Copilot AI review requested due to automatic review settings May 13, 2026 16:21
@github-actions github-actions Bot added documentation Improvements or additions to documentation tests labels May 13, 2026
@github-actions

github-actions Bot commented May 13, 2026

Copy link
Copy Markdown

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

Scanned Files

None

github-actions[bot]
github-actions Bot previously approved these changes May 13, 2026

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated approval for maintainer PR

All automated quality gates passed. See SECURITY_CONTROLS.md for compensating controls.

@codecov

codecov Bot commented May 13, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 82.14286% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 87.68%. Comparing base (d014841) to head (a80e215).

Files with missing lines Patch % Lines
web/health.go 50.00% 3 Missing ⚠️
core/docker_sdk_provider.go 0.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #636      +/-   ##
==========================================
+ Coverage   87.38%   87.68%   +0.30%     
==========================================
  Files          88       88              
  Lines       10722    10739      +17     
==========================================
+ Hits         9369     9417      +48     
+ Misses       1112     1084      -28     
+ Partials      241      238       -3     
Flag Coverage Δ
integration 87.68% <82.14%> (+0.30%) ⬆️
unittests 84.40% <82.14%> (+0.24%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions

Copy link
Copy Markdown

✅ Mutation Testing Results

Mutation Score: 100.00% (threshold: 60%)

✨ Good job! Mutation score meets the threshold.

What is mutation testing?

Mutation testing measures test quality by introducing small changes (mutations) to the code and checking if tests detect them. A higher score means better test effectiveness.

  • Killed mutants: Tests caught the mutation (good!)
  • Survived mutants: Tests missed the mutation (needs improvement)

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Wraps the three remaining unbounded context.Background() Docker SDK call sites (web health checker, daemon-startup sanity Pings, and ofelia doctor) in context.WithTimeout, so a reachable-but-wedged Docker daemon can no longer hang Ofelia or its /health, /ready, and doctor diagnostics. Completes the audit started in #608/#611.

Changes:

  • Add per-call timeouts: dockerHealthCheckTimeout (5s) in web/health.go, dockerStartupPingTimeout (10s) in cli/docker_config_handler.go, doctorDockerCallTimeout (5s) in cli/doctor.go.
  • Introduce SDKDockerProviderConfig.NegotiateTimeout and a newSDKDockerProvider package-level factory as test seams.
  • Add five regression tests using hanging mock providers that verify each fix site returns within its bound.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated no comments.

Show a summary per file
File Description
web/health.go Bound Ping/Info calls in periodic health checker with 5s timeouts
web/health_docker_timeout_test.go Regression test asserting checkDocker returns within bound on wedged provider
cli/docker_config_handler.go Bound startup sanity Pings (10s) in NewDockerHandler and buildSDKProvider; add newSDKDockerProvider test seam
cli/docker_config_handler_timeout_test.go Regression tests for both sanity-Ping bound sites
cli/doctor.go Bound Ping and per-image HasImageLocally calls (5s) in doctor diagnostic
cli/doctor_docker_timeout_test.go Regression tests for checkDocker/checkDockerImages bounds with hanging stub
core/docker_sdk_provider.go Add optional NegotiateTimeout field plumbed into adapter ClientConfig
CHANGELOG.md Document the fix and the new optional config field under [Unreleased] → Fixed

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses issues where unbounded Docker SDK calls could cause Ofelia to hang when interacting with a reachable-but-wedged Docker daemon. The changes introduce bounded contexts with specific timeouts across the startup sanity checks, the 'ofelia doctor' diagnostic tool, and the background health checker. Additionally, the PR adds test coverage for these timeout scenarios and exposes a 'NegotiateTimeout' configuration for the SDK provider to improve testability. The review feedback suggests improving context propagation in 'buildSDKProvider' by utilizing the parent context from 'NewDockerHandler' instead of 'context.Background()'.

Comment thread cli/docker_config_handler.go Outdated
- buildSDKProvider now derives its bounded ping ctx from c.ctx (not
  context.Background) so SIGINT during startup also cancels the sanity
  ping (code review IMPORTANT).
- Bump expectWithin in startup-ping tests from 15s -> 20s for CI
  scheduling headroom (test review nit; spec was 10s timeout, 2x floor).
- TROUBLESHOOTING: new "Docker Daemon Wedged" entry covering the new
  behaviour — what used to be an indefinite hang is now a loud
  context-deadline failure (DX review gap).
- CHANGELOG: clarify that timeouts are unexported constants, document
  the per-call vs overall trade-off in doctor's HasImageLocally, and
  note c.ctx propagation (DX + code review notes).

Signed-off-by: Sebastian Mendel <[email protected]>
@sonarqubecloud

Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated approval for maintainer PR

All automated quality gates passed. See SECURITY_CONTROLS.md for compensating controls.

@CybotTM CybotTM added this pull request to the merge queue May 13, 2026
Merged via the queue into main with commit 2f9b8ca May 13, 2026
27 checks passed
@CybotTM CybotTM deleted the fix/issue-614 branch May 13, 2026 18:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bound remaining unbounded context.Background() Docker calls (web/health, cli/docker_config_handler, cli/doctor)

2 participants