Skip to content

fix(core): stop & remove orphaned containers/services on MaxRuntime cancellation#659

Merged
CybotTM merged 2 commits into
mainfrom
fix/issue-655
May 14, 2026
Merged

fix(core): stop & remove orphaned containers/services on MaxRuntime cancellation#659
CybotTM merged 2 commits into
mainfrom
fix/issue-655

Conversation

@CybotTM

@CybotTM CybotTM commented May 14, 2026

Copy link
Copy Markdown
Member

Summary

Closes #655 (follow-up to #651 / #638).

PR #651 wired a wrapper-level deadline into jobWrapper.runWithCtx so the inner SDK calls return when the deadline fires. This PR finishes the job: when the deadline actually fires mid-run, the container / swarm service is now torn down instead of being left orphaned.

  • RunJob (core/runjob.go): the deferred deleteContainer reused the now-expired runCtx, so the Remove call no-op'd and the container kept running. The cleanup defer now detects an expired parent context and issues an explicit best-effort StopContainer (10s timeout) followed by RemoveContainer, both on a fresh context.Background()-derived context bounded by jobCleanupTimeout (30s, mirroring DefaultStopTimeout). Stop ordering matters: Remove on a still-running container races the daemon. Stop failures log + continue so the Remove attempt always runs.
  • RunServiceJob (core/runservice.go): the ticker watcher only checked j.MaxRuntime, never ctx.Done(). With the wrapper deadline firing, the inner SDK calls bailed out but the watcher kept spinning until the per-job MaxRuntime elapsed (or, when unset, until the default 24h ceiling — orders of magnitude longer than the operator intended). The loop now selects on ctx.Done() alongside ticker.C, and Run() issues a fresh-context RemoveService on the deadline path so the swarm does not retain a phantom task.
  • ExecJob (core/execjob.go): documented as a known limitation in the Run godoc. The Docker Engine API exposes no ExecStop primitive — once an exec session starts, Ofelia cannot kill the in-container process from the outside. Operators relying on a hard ceiling for job-exec MUST enforce it inside their entrypoint (e.g. timeout 30s ...).

Also tightens the misleading localJobEnvResolveTimeout comment in core/localjob.go. The previous wording said the bound guarded against a "wedged Docker daemon (env-from)", but LocalJob passes a nil DockerProvider into ResolveJobEnvironment, so Docker is never contacted on this path. The bound exists strictly to cap env-FILE disk reads (a hung NFS mount, a pathologically large file).

Scope note

The user's spec also mentioned widening [global] max-runtime inheritance to ExecJob / LocalJob / ComposeJob (which do not currently have a MaxRuntime field at all). That is roughly 200-300 LOC on its own (3 struct fields + 3 GetMaxRuntime methods + 12 config.go propagation sites + tests) and would push this PR past the 500 LOC ceiling. Deferred to a follow-up so this PR stays focused on the actual cleanup bug.

Test plan

New tests in core/issue655_test.go:

  • TestRunJob_DeadlineFiresStopAndDelete — hanging stub WaitContainer, parent ctx with 50ms deadline, asserts both StopContainer and RemoveContainer were called AND that each was invoked with a non-expired (fresh) context
  • TestRunJob_DeadlineCleanupContinuesOnStopError — stop returns error; assert delete is still attempted
  • TestRunServiceJob_WatcherHonorsContextCancellation — always-running task; assert watchContainer returns within 2s of parent ctx cancellation (without this fix, it spins until j.MaxRuntime elapses)
  • TestRunServiceJob_DeadlineRemovesService — assert RemoveService is called on the deadline path with a fresh (non-expired) context

Verification:

  • go test ./... -count=1 -short -race -timeout=300s — all packages pass
  • golangci-lint run ./... — 0 issues
  • No regressions in existing TestRunJobUnit_* / TestRunServiceJobUnit_* suites

…ancellation

PR #651 wired a wrapper-level deadline into jobWrapper.runWithCtx so
inner SDK calls return when the deadline fires, but three downstream
paths did not finish the teardown:

* RunJob: deferred deleteContainer reused the now-expired runCtx, so
  the Remove call no-op'd and the container was left running. The
  cleanup defer now detects an expired parent and issues an explicit
  best-effort StopContainer (10s) followed by RemoveContainer, both on
  a fresh context.Background()-derived ctx bounded by jobCleanupTimeout
  (30s). Stop ordering matters; stop failures log and continue.
* RunServiceJob: ticker watcher only checked j.MaxRuntime, never
  ctx.Done(). The loop now selects on ctx.Done() alongside ticker.C,
  and Run() issues a fresh-context RemoveService when the per-run ctx
  has already expired so swarm doesn't retain a phantom task.
* ExecJob: documented as a known limitation. The Docker Engine API has
  no ExecStop primitive; operators relying on a hard ceiling for
  job-exec MUST enforce it inside their entrypoint (e.g. timeout 30s).

Also tightens the misleading localJobEnvResolveTimeout comment in
core/localjob.go: LocalJob passes a nil DockerProvider into
ResolveJobEnvironment, so the bound is for env-FILE disk reads (a hung
NFS mount), NOT a wedged Docker daemon as the previous comment claimed.

New tests in core/issue655_test.go pin the deadline cleanup contract
for both RunJob and RunServiceJob, including:
- TestRunJob_DeadlineFiresStopAndDelete: stop+delete with non-expired ctx
- TestRunJob_DeadlineCleanupContinuesOnStopError: stop failure does not
  block delete attempt
- TestRunServiceJob_WatcherHonorsContextCancellation: watcher returns
  promptly on parent ctx cancellation
- TestRunServiceJob_DeadlineRemovesService: RemoveService runs with a
  fresh non-expired context

Refs #655, #651, #638

Signed-off-by: Sebastian Mendel <[email protected]>
Copilot AI review requested due to automatic review settings May 14, 2026 06:37
@github-actions

github-actions Bot commented May 14, 2026

Copy link
Copy Markdown

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

Scanned Files

None

@github-actions github-actions Bot added documentation Improvements or additions to documentation tests labels May 14, 2026
github-actions[bot]
github-actions Bot previously approved these changes May 14, 2026

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated approval for maintainer PR

All automated quality gates passed. See SECURITY_CONTROLS.md for compensating controls.

@github-actions

github-actions Bot commented May 14, 2026

Copy link
Copy Markdown

✅ Mutation Testing Results

Mutation Score: 75.00% (threshold: 60%)

✨ Good job! Mutation score meets the threshold.

What is mutation testing?

Mutation testing measures test quality by introducing small changes (mutations) to the code and checking if tests detect them. A higher score means better test effectiveness.

  • Killed mutants: Tests caught the mutation (good!)
  • Survived mutants: Tests missed the mutation (needs improvement)

@codecov

codecov Bot commented May 14, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 76.47059% with 8 lines in your changes missing coverage. Please review.
✅ Project coverage is 87.85%. Comparing base (86e8cd8) to head (9eb5aa5).
⚠️ Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
core/runjob.go 73.33% 2 Missing and 2 partials ⚠️
core/runservice.go 78.94% 3 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #659      +/-   ##
==========================================
- Coverage   87.87%   87.85%   -0.02%     
==========================================
  Files          88       88              
  Lines       10972    11004      +32     
==========================================
+ Hits         9642     9668      +26     
- Misses       1086     1090       +4     
- Partials      244      246       +2     
Flag Coverage Δ
integration 87.84% <76.47%> (-0.04%) ⬇️
unittests 84.95% <76.47%> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses issue #655 by ensuring that Docker containers and Swarm services are properly cleaned up when a job's execution deadline is reached. It introduces a best-effort cleanup path using a fresh context to prevent orphaned resources and updates the service watcher to honor context cancellation. Feedback was provided to ensure that errors during service removal are logged as warnings rather than being silently swallowed when a primary execution error occurs.

Comment thread core/runservice.go Outdated

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR completes the MaxRuntime cancellation story in core/ by ensuring that when the scheduler/job wrapper context deadline fires mid-run, Docker resources (containers and swarm services) are actively torn down instead of being left orphaned, and documents the docker exec limitation.

Changes:

  • Add deadline-aware, best-effort stop+remove cleanup for RunJob on expired per-run contexts (fresh bounded cleanup context).
  • Update RunServiceJob to (a) stop its watcher on ctx.Done() and (b) remove the swarm service on the deadline path using a fresh bounded cleanup context.
  • Add regression tests for issue #655 and document the ExecJob limitation; clarify LocalJob env-resolution timeout comment and update the changelog.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
core/runjob.go Introduces a bounded cleanup timeout and performs stop+remove on a fresh context when the per-run context has already expired.
core/runservice.go Makes the watcher honor ctx.Done() and ensures service removal uses a fresh cleanup context on deadline/cancellation.
core/localjob.go Clarifies that the env resolution timeout bounds env-file reads (not Docker calls) for job-local.
core/issue655_test.go Adds focused regression tests covering container/service cleanup on wrapper deadline firing.
core/execjob.go Documents the Docker API limitation: no ExecStop, so cancellation can’t kill the in-container process.
CHANGELOG.md Records the user-visible fix and notes the exec limitation and local-job comment correction.

Comment thread core/runservice.go Outdated
Comment thread core/runservice.go
Address PR #659 review (gemini + Copilot, both flagged): (1) deleteService failure was silently swallowed when watchErr was non-nil — now logged via ctx.Log so orphan services are observable in operator logs. (2) When ctx.Done() fired before the watcher observed a terminal task state, the function logged 'completed with exit code -999' (the ExitCodeSwarmError sentinel), misleading operators into thinking the swarm task itself failed. Now logs the cancellation reason explicitly.

Signed-off-by: Sebastian Mendel <[email protected]>
@sonarqubecloud

Copy link
Copy Markdown

Quality Gate Failed Quality Gate failed

Failed conditions
10.8% Duplication on New Code (required ≤ 3%)

See analysis details on SonarQube Cloud

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated approval for maintainer PR

All automated quality gates passed. See SECURITY_CONTROLS.md for compensating controls.

@CybotTM CybotTM added this pull request to the merge queue May 14, 2026
Merged via the queue into main with commit 518b5e4 May 14, 2026
25 of 27 checks passed
@CybotTM CybotTM deleted the fix/issue-655 branch May 14, 2026 11:13
@CybotTM CybotTM mentioned this pull request May 14, 2026
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

core: scheduler MaxRuntime cancellation must actually stop the running container/exec/service

2 participants