fix(core): stop & remove orphaned containers/services on MaxRuntime cancellation by CybotTM · Pull Request #659 · netresearch/ofelia

CybotTM · 2026-05-14T06:37:20Z

Summary

Closes #655 (follow-up to #651 / #638).

PR #651 wired a wrapper-level deadline into jobWrapper.runWithCtx so the inner SDK calls return when the deadline fires. This PR finishes the job: when the deadline actually fires mid-run, the container / swarm service is now torn down instead of being left orphaned.

RunJob (core/runjob.go): the deferred deleteContainer reused the now-expired runCtx, so the Remove call no-op'd and the container kept running. The cleanup defer now detects an expired parent context and issues an explicit best-effort StopContainer (10s timeout) followed by RemoveContainer, both on a fresh context.Background()-derived context bounded by jobCleanupTimeout (30s, mirroring DefaultStopTimeout). Stop ordering matters: Remove on a still-running container races the daemon. Stop failures log + continue so the Remove attempt always runs.
RunServiceJob (core/runservice.go): the ticker watcher only checked j.MaxRuntime, never ctx.Done(). With the wrapper deadline firing, the inner SDK calls bailed out but the watcher kept spinning until the per-job MaxRuntime elapsed (or, when unset, until the default 24h ceiling — orders of magnitude longer than the operator intended). The loop now selects on ctx.Done() alongside ticker.C, and Run() issues a fresh-context RemoveService on the deadline path so the swarm does not retain a phantom task.
ExecJob (core/execjob.go): documented as a known limitation in the Run godoc. The Docker Engine API exposes no ExecStop primitive — once an exec session starts, Ofelia cannot kill the in-container process from the outside. Operators relying on a hard ceiling for job-exec MUST enforce it inside their entrypoint (e.g. timeout 30s ...).

Also tightens the misleading localJobEnvResolveTimeout comment in core/localjob.go. The previous wording said the bound guarded against a "wedged Docker daemon (env-from)", but LocalJob passes a nil DockerProvider into ResolveJobEnvironment, so Docker is never contacted on this path. The bound exists strictly to cap env-FILE disk reads (a hung NFS mount, a pathologically large file).

Scope note

The user's spec also mentioned widening [global] max-runtime inheritance to ExecJob / LocalJob / ComposeJob (which do not currently have a MaxRuntime field at all). That is roughly 200-300 LOC on its own (3 struct fields + 3 GetMaxRuntime methods + 12 config.go propagation sites + tests) and would push this PR past the 500 LOC ceiling. Deferred to a follow-up so this PR stays focused on the actual cleanup bug.

Test plan

New tests in core/issue655_test.go:

TestRunJob_DeadlineFiresStopAndDelete — hanging stub WaitContainer, parent ctx with 50ms deadline, asserts both StopContainer and RemoveContainer were called AND that each was invoked with a non-expired (fresh) context
TestRunJob_DeadlineCleanupContinuesOnStopError — stop returns error; assert delete is still attempted
TestRunServiceJob_WatcherHonorsContextCancellation — always-running task; assert watchContainer returns within 2s of parent ctx cancellation (without this fix, it spins until j.MaxRuntime elapses)
TestRunServiceJob_DeadlineRemovesService — assert RemoveService is called on the deadline path with a fresh (non-expired) context

Verification:

go test ./... -count=1 -short -race -timeout=300s — all packages pass
golangci-lint run ./... — 0 issues
No regressions in existing TestRunJobUnit_* / TestRunServiceJobUnit_* suites

…ancellation PR #651 wired a wrapper-level deadline into jobWrapper.runWithCtx so inner SDK calls return when the deadline fires, but three downstream paths did not finish the teardown: * RunJob: deferred deleteContainer reused the now-expired runCtx, so the Remove call no-op'd and the container was left running. The cleanup defer now detects an expired parent and issues an explicit best-effort StopContainer (10s) followed by RemoveContainer, both on a fresh context.Background()-derived ctx bounded by jobCleanupTimeout (30s). Stop ordering matters; stop failures log and continue. * RunServiceJob: ticker watcher only checked j.MaxRuntime, never ctx.Done(). The loop now selects on ctx.Done() alongside ticker.C, and Run() issues a fresh-context RemoveService when the per-run ctx has already expired so swarm doesn't retain a phantom task. * ExecJob: documented as a known limitation. The Docker Engine API has no ExecStop primitive; operators relying on a hard ceiling for job-exec MUST enforce it inside their entrypoint (e.g. timeout 30s). Also tightens the misleading localJobEnvResolveTimeout comment in core/localjob.go: LocalJob passes a nil DockerProvider into ResolveJobEnvironment, so the bound is for env-FILE disk reads (a hung NFS mount), NOT a wedged Docker daemon as the previous comment claimed. New tests in core/issue655_test.go pin the deadline cleanup contract for both RunJob and RunServiceJob, including: - TestRunJob_DeadlineFiresStopAndDelete: stop+delete with non-expired ctx - TestRunJob_DeadlineCleanupContinuesOnStopError: stop failure does not block delete attempt - TestRunServiceJob_WatcherHonorsContextCancellation: watcher returns promptly on parent ctx cancellation - TestRunServiceJob_DeadlineRemovesService: RemoveService runs with a fresh non-expired context Refs #655, #651, #638 Signed-off-by: Sebastian Mendel <[email protected]>

github-actions · 2026-05-14T06:37:36Z

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

Scanned Files

None

github-actions

Automated approval for maintainer PR

All automated quality gates passed. See SECURITY_CONTROLS.md for compensating controls.

github-actions · 2026-05-14T06:39:01Z

✅ Mutation Testing Results

Mutation Score: 75.00% (threshold: 60%)

✨ Good job! Mutation score meets the threshold.

What is mutation testing?

Mutation testing measures test quality by introducing small changes (mutations) to the code and checking if tests detect them. A higher score means better test effectiveness.

Killed mutants: Tests caught the mutation (good!)
Survived mutants: Tests missed the mutation (needs improvement)

codecov · 2026-05-14T06:39:23Z

Codecov Report

❌ Patch coverage is 76.47059% with 8 lines in your changes missing coverage. Please review.
✅ Project coverage is 87.85%. Comparing base (86e8cd8) to head (9eb5aa5).
⚠️ Report is 3 commits behind head on main.

Files with missing lines	Patch %	Lines
core/runjob.go	73.33%	2 Missing and 2 partials ⚠️
core/runservice.go	78.94%	3 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #659      +/-   ##
==========================================
- Coverage   87.87%   87.85%   -0.02%     
==========================================
  Files          88       88              
  Lines       10972    11004      +32     
==========================================
+ Hits         9642     9668      +26     
- Misses       1086     1090       +4     
- Partials      244      246       +2

Flag	Coverage Δ
integration	`87.84% <76.47%> (-0.04%)`	⬇️
unittests	`84.95% <76.47%> (-0.02%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

gemini-code-assist

Code Review

This pull request addresses issue #655 by ensuring that Docker containers and Swarm services are properly cleaned up when a job's execution deadline is reached. It introduces a best-effort cleanup path using a fresh context to prevent orphaned resources and updates the service watcher to honor context cancellation. Feedback was provided to ensure that errors during service removal are logged as warnings rather than being silently swallowed when a primary execution error occurs.

Copilot

Pull request overview

This PR completes the MaxRuntime cancellation story in core/ by ensuring that when the scheduler/job wrapper context deadline fires mid-run, Docker resources (containers and swarm services) are actively torn down instead of being left orphaned, and documents the docker exec limitation.

Changes:

Add deadline-aware, best-effort stop+remove cleanup for RunJob on expired per-run contexts (fresh bounded cleanup context).
Update RunServiceJob to (a) stop its watcher on ctx.Done() and (b) remove the swarm service on the deadline path using a fresh bounded cleanup context.
Add regression tests for issue #655 and document the ExecJob limitation; clarify LocalJob env-resolution timeout comment and update the changelog.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
core/runjob.go	Introduces a bounded cleanup timeout and performs stop+remove on a fresh context when the per-run context has already expired.
core/runservice.go	Makes the watcher honor `ctx.Done()` and ensures service removal uses a fresh cleanup context on deadline/cancellation.
core/localjob.go	Clarifies that the env resolution timeout bounds env-file reads (not Docker calls) for `job-local`.
core/issue655_test.go	Adds focused regression tests covering container/service cleanup on wrapper deadline firing.
core/execjob.go	Documents the Docker API limitation: no `ExecStop`, so cancellation can’t kill the in-container process.
CHANGELOG.md	Records the user-visible fix and notes the exec limitation and local-job comment correction.

Address PR #659 review (gemini + Copilot, both flagged): (1) deleteService failure was silently swallowed when watchErr was non-nil — now logged via ctx.Log so orphan services are observable in operator logs. (2) When ctx.Done() fired before the watcher observed a terminal task state, the function logged 'completed with exit code -999' (the ExitCodeSwarmError sentinel), misleading operators into thinking the swarm task itself failed. Now logs the cancellation reason explicitly. Signed-off-by: Sebastian Mendel <[email protected]>

sonarqubecloud · 2026-05-14T07:00:35Z

Quality Gate failed

Failed conditions
10.8% Duplication on New Code (required ≤ 3%)

See analysis details on SonarQube Cloud

github-actions

Automated approval for maintainer PR

All automated quality gates passed. See SECURITY_CONTROLS.md for compensating controls.

Copilot AI review requested due to automatic review settings May 14, 2026 06:37

github-actions Bot added documentation Improvements or additions to documentation tests labels May 14, 2026

github-actions Bot previously approved these changes May 14, 2026

View reviewed changes

Copilot started reviewing on behalf of CybotTM May 14, 2026 06:38 View session

gemini-code-assist Bot reviewed May 14, 2026

View reviewed changes

Comment thread core/runservice.go Outdated

Copilot AI reviewed May 14, 2026

View reviewed changes

Comment thread core/runservice.go Outdated

Comment thread core/runservice.go

CybotTM dismissed github-actions[bot]’s stale review via 9eb5aa5 May 14, 2026 07:00

github-actions Bot approved these changes May 14, 2026

View reviewed changes

CybotTM added this pull request to the merge queue May 14, 2026

Merged via the queue into main with commit 518b5e4 May 14, 2026
25 of 27 checks passed

CybotTM deleted the fix/issue-655 branch May 14, 2026 11:13

CybotTM mentioned this pull request May 14, 2026

chore: prepare v0.25.0 release #663

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(core): stop & remove orphaned containers/services on MaxRuntime cancellation#659

fix(core): stop & remove orphaned containers/services on MaxRuntime cancellation#659
CybotTM merged 2 commits into
mainfrom
fix/issue-655

CybotTM commented May 14, 2026

Uh oh!

github-actions Bot commented May 14, 2026 •

edited

Loading

Uh oh!

github-actions Bot left a comment

Uh oh!

github-actions Bot commented May 14, 2026 •

edited

Loading

Uh oh!

codecov Bot commented May 14, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

sonarqubecloud Bot commented May 14, 2026

Uh oh!

github-actions Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

CybotTM commented May 14, 2026

Summary

Scope note

Test plan

Uh oh!

github-actions Bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Dependency Review

Scanned Files

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Mutation Testing Results

Uh oh!

codecov Bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

sonarqubecloud Bot commented May 14, 2026

Quality Gate failed

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions Bot commented May 14, 2026 •

edited

Loading

github-actions Bot commented May 14, 2026 •

edited

Loading

codecov Bot commented May 14, 2026 •

edited

Loading