fix(core): stop & remove orphaned containers/services on MaxRuntime cancellation#659
Conversation
…ancellation PR #651 wired a wrapper-level deadline into jobWrapper.runWithCtx so inner SDK calls return when the deadline fires, but three downstream paths did not finish the teardown: * RunJob: deferred deleteContainer reused the now-expired runCtx, so the Remove call no-op'd and the container was left running. The cleanup defer now detects an expired parent and issues an explicit best-effort StopContainer (10s) followed by RemoveContainer, both on a fresh context.Background()-derived ctx bounded by jobCleanupTimeout (30s). Stop ordering matters; stop failures log and continue. * RunServiceJob: ticker watcher only checked j.MaxRuntime, never ctx.Done(). The loop now selects on ctx.Done() alongside ticker.C, and Run() issues a fresh-context RemoveService when the per-run ctx has already expired so swarm doesn't retain a phantom task. * ExecJob: documented as a known limitation. The Docker Engine API has no ExecStop primitive; operators relying on a hard ceiling for job-exec MUST enforce it inside their entrypoint (e.g. timeout 30s). Also tightens the misleading localJobEnvResolveTimeout comment in core/localjob.go: LocalJob passes a nil DockerProvider into ResolveJobEnvironment, so the bound is for env-FILE disk reads (a hung NFS mount), NOT a wedged Docker daemon as the previous comment claimed. New tests in core/issue655_test.go pin the deadline cleanup contract for both RunJob and RunServiceJob, including: - TestRunJob_DeadlineFiresStopAndDelete: stop+delete with non-expired ctx - TestRunJob_DeadlineCleanupContinuesOnStopError: stop failure does not block delete attempt - TestRunServiceJob_WatcherHonorsContextCancellation: watcher returns promptly on parent ctx cancellation - TestRunServiceJob_DeadlineRemovesService: RemoveService runs with a fresh non-expired context Refs #655, #651, #638 Signed-off-by: Sebastian Mendel <[email protected]>
Dependency Review✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.Scanned FilesNone |
There was a problem hiding this comment.
Automated approval for maintainer PR
All automated quality gates passed. See SECURITY_CONTROLS.md for compensating controls.
✅ Mutation Testing ResultsMutation Score: 75.00% (threshold: 60%)
What is mutation testing?Mutation testing measures test quality by introducing small changes (mutations) to the code and checking if tests detect them. A higher score means better test effectiveness.
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #659 +/- ##
==========================================
- Coverage 87.87% 87.85% -0.02%
==========================================
Files 88 88
Lines 10972 11004 +32
==========================================
+ Hits 9642 9668 +26
- Misses 1086 1090 +4
- Partials 244 246 +2
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Code Review
This pull request addresses issue #655 by ensuring that Docker containers and Swarm services are properly cleaned up when a job's execution deadline is reached. It introduces a best-effort cleanup path using a fresh context to prevent orphaned resources and updates the service watcher to honor context cancellation. Feedback was provided to ensure that errors during service removal are logged as warnings rather than being silently swallowed when a primary execution error occurs.
There was a problem hiding this comment.
Pull request overview
This PR completes the MaxRuntime cancellation story in core/ by ensuring that when the scheduler/job wrapper context deadline fires mid-run, Docker resources (containers and swarm services) are actively torn down instead of being left orphaned, and documents the docker exec limitation.
Changes:
- Add deadline-aware, best-effort stop+remove cleanup for
RunJobon expired per-run contexts (fresh bounded cleanup context). - Update
RunServiceJobto (a) stop its watcher onctx.Done()and (b) remove the swarm service on the deadline path using a fresh bounded cleanup context. - Add regression tests for issue #655 and document the
ExecJoblimitation; clarifyLocalJobenv-resolution timeout comment and update the changelog.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| core/runjob.go | Introduces a bounded cleanup timeout and performs stop+remove on a fresh context when the per-run context has already expired. |
| core/runservice.go | Makes the watcher honor ctx.Done() and ensures service removal uses a fresh cleanup context on deadline/cancellation. |
| core/localjob.go | Clarifies that the env resolution timeout bounds env-file reads (not Docker calls) for job-local. |
| core/issue655_test.go | Adds focused regression tests covering container/service cleanup on wrapper deadline firing. |
| core/execjob.go | Documents the Docker API limitation: no ExecStop, so cancellation can’t kill the in-container process. |
| CHANGELOG.md | Records the user-visible fix and notes the exec limitation and local-job comment correction. |
Address PR #659 review (gemini + Copilot, both flagged): (1) deleteService failure was silently swallowed when watchErr was non-nil — now logged via ctx.Log so orphan services are observable in operator logs. (2) When ctx.Done() fired before the watcher observed a terminal task state, the function logged 'completed with exit code -999' (the ExitCodeSwarmError sentinel), misleading operators into thinking the swarm task itself failed. Now logs the cancellation reason explicitly. Signed-off-by: Sebastian Mendel <[email protected]>
|
There was a problem hiding this comment.
Automated approval for maintainer PR
All automated quality gates passed. See SECURITY_CONTROLS.md for compensating controls.


Summary
Closes #655 (follow-up to #651 / #638).
PR #651 wired a wrapper-level deadline into
jobWrapper.runWithCtxso the inner SDK calls return when the deadline fires. This PR finishes the job: when the deadline actually fires mid-run, the container / swarm service is now torn down instead of being left orphaned.RunJob(core/runjob.go): the deferreddeleteContainerreused the now-expiredrunCtx, so the Remove call no-op'd and the container kept running. The cleanup defer now detects an expired parent context and issues an explicit best-effortStopContainer(10s timeout) followed byRemoveContainer, both on a freshcontext.Background()-derived context bounded byjobCleanupTimeout(30s, mirroringDefaultStopTimeout). Stop ordering matters:Removeon a still-running container races the daemon. Stop failures log + continue so the Remove attempt always runs.RunServiceJob(core/runservice.go): the ticker watcher only checkedj.MaxRuntime, neverctx.Done(). With the wrapper deadline firing, the inner SDK calls bailed out but the watcher kept spinning until the per-job MaxRuntime elapsed (or, when unset, until the default 24h ceiling — orders of magnitude longer than the operator intended). The loop nowselects onctx.Done()alongsideticker.C, andRun()issues a fresh-contextRemoveServiceon the deadline path so the swarm does not retain a phantom task.ExecJob(core/execjob.go): documented as a known limitation in theRungodoc. The Docker Engine API exposes noExecStopprimitive — once an exec session starts, Ofelia cannot kill the in-container process from the outside. Operators relying on a hard ceiling forjob-execMUST enforce it inside their entrypoint (e.g.timeout 30s ...).Also tightens the misleading
localJobEnvResolveTimeoutcomment incore/localjob.go. The previous wording said the bound guarded against a "wedged Docker daemon (env-from)", butLocalJobpasses a nilDockerProviderintoResolveJobEnvironment, so Docker is never contacted on this path. The bound exists strictly to cap env-FILE disk reads (a hung NFS mount, a pathologically large file).Scope note
The user's spec also mentioned widening
[global] max-runtimeinheritance toExecJob/LocalJob/ComposeJob(which do not currently have aMaxRuntimefield at all). That is roughly 200-300 LOC on its own (3 struct fields + 3 GetMaxRuntime methods + 12 config.go propagation sites + tests) and would push this PR past the 500 LOC ceiling. Deferred to a follow-up so this PR stays focused on the actual cleanup bug.Test plan
New tests in
core/issue655_test.go:TestRunJob_DeadlineFiresStopAndDelete— hanging stubWaitContainer, parent ctx with 50ms deadline, asserts bothStopContainerandRemoveContainerwere called AND that each was invoked with a non-expired (fresh) contextTestRunJob_DeadlineCleanupContinuesOnStopError— stop returns error; assert delete is still attemptedTestRunServiceJob_WatcherHonorsContextCancellation— always-running task; assertwatchContainerreturns within 2s of parent ctx cancellation (without this fix, it spins until j.MaxRuntime elapses)TestRunServiceJob_DeadlineRemovesService— assertRemoveServiceis called on the deadline path with a fresh (non-expired) contextVerification:
go test ./... -count=1 -short -race -timeout=300s— all packages passgolangci-lint run ./...— 0 issuesTestRunJobUnit_*/TestRunServiceJobUnit_*suites