fix(core): bound remaining unbounded contexts in scheduler and jobs (#638)#651
Conversation
Sibling-hunt during PR #636 (which fixed #614 by bounding the cli/web Docker pings) flagged seven more unbounded context.Background() / context.TODO() sites in core/ that could let job execution hang on a wedged Docker upstream: - core/common.go:NewContext (foundational; every job inherited) - core/scheduler.go:maxConcurrentSkipJob.Run (cron.Job fallback) - core/scheduler.go:jobWrapper.Run (cron.Job fallback) - core/execjob.go:47 (nil ctx.Ctx fallback) - core/runjob.go:107 (nil ctx.Ctx fallback) - core/runservice.go:72 (nil ctx.Ctx fallback) - core/localjob.go:54 (ResolveJobEnvironment unbounded) Three coordinated fixes: 1. Centralize the nil-fallback in (*Context).RunContext() so the four inline "if runCtx == nil { runCtx = context.Background() }" blocks become a single helper call. NewContext now delegates to NewContextWithContext(context.TODO(), ...) to flag the unset-parent case to readers and static analyzers. 2. Apply scheduler-level MaxRuntime enforcement in jobWrapper.runWithCtx via the new boundJobContext helper + MaxRuntimeProvider interface. RunJob and RunServiceJob (which carry MaxRuntime fields) implement the interface and feed their per-job deadline into the bound; all other job types (ExecJob, LocalJob, ComposeJob) inherit the defaultJobMaxRuntime = 24h ceiling. Previously RunJob was the only job type whose internal startAndWait re-wrapped with WithTimeout; ExecJob and RunServiceJob could hang against a slow docker exec/service create with no upper bound. 3. Bound LocalJob's ResolveJobEnvironment call (env-from-container) with a 10s localJobEnvResolveTimeout so a wedged Docker daemon cannot stall local-job startup. Also replaces the two defensive context.Background() fallbacks in the cron.Job (non-context) paths with context.TODO() to make their intent explicit. Tests: - TestContext_RunContext_NilFallback / _PassThrough: helper contract - TestNewContext_NeverNilCtx: factory invariant - TestExecJob_RespectsBoundedParent: executor surfaces ctx cancel - TestBoundJobContext_FromJobMaxRuntime / _GlobalDefault: helper math - TestJobWrapper_BoundsExecJobByDefaultMaxRuntime: end-to-end regression — feeds the wrapper an UNBOUNDED context.TODO() and asserts the inner Docker mock receives a context with a Deadline, proving the wrapper-level bound is applied even for job types that do not implement MaxRuntimeProvider themselves. Caveat documented in CHANGELOG: [global] max-runtime today only inherits into RunJob/ServiceJob at config-load time (cli/config.go:449,471). Operators wanting a sub-day ceiling on ExecJob/LocalJob/ComposeJob still need a per-job max-runtime; widening the inheritance is left for a follow-up. Refs: #614, #636 Fixes: #638 Signed-off-by: Sebastian Mendel <[email protected]>
|
Dependency Review✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.Scanned FilesNone |
There was a problem hiding this comment.
Automated approval for maintainer PR
All automated quality gates passed. See SECURITY_CONTROLS.md for compensating controls.
There was a problem hiding this comment.
Code Review
This pull request addresses issue #638 by centralizing and enforcing context timeouts across all job types in the core package. It introduces a MaxRuntimeProvider interface and a boundJobContext helper in the scheduler to ensure that even jobs without explicit MaxRuntime configurations (like ExecJob or LocalJob) are bounded by a default 24-hour timeout. Additionally, it refactors Context creation to use context.TODO() for better visibility of missing parent contexts and adds a 10-second timeout for environment resolution in local jobs. Comprehensive tests were added to verify these bounding behaviors. I have no further feedback to provide.
✅ Mutation Testing ResultsMutation Score: 66.67% (threshold: 60%)
What is mutation testing?Mutation testing measures test quality by introducing small changes (mutations) to the code and checking if tests detect them. A higher score means better test effectiveness.
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #651 +/- ##
==========================================
- Coverage 87.86% 87.82% -0.05%
==========================================
Files 88 88
Lines 10898 10908 +10
==========================================
+ Hits 9576 9580 +4
- Misses 1081 1085 +4
- Partials 241 243 +2
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
This PR attempts to bound remaining core scheduler/job contexts so Docker-backed job execution is less likely to hang indefinitely when Docker is wedged.
Changes:
- Adds scheduler-level per-run context deadlines via
boundJobContextandMaxRuntimeProvider. - Centralizes job context fallback in
(*Context).RunContext(). - Adds tests and changelog coverage for bounded context behavior.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
core/scheduler.go |
Adds default/max-runtime context wrapping around job execution. |
core/common.go |
Adds RunContext() and non-nil context construction fallback. |
core/execjob.go |
Switches exec job Docker calls to use RunContext(). |
core/runjob.go |
Exposes run-job max runtime and uses RunContext(). |
core/runservice.go |
Exposes service-job max runtime and uses RunContext(). |
core/localjob.go |
Adds a timeout around local job environment resolution. |
core/context_bounded_test.go |
Adds regression tests for context fallback and scheduler deadline wrapping. |
CHANGELOG.md |
Documents the context-bounding fix. |
Comments suppressed due to low confidence (3)
core/scheduler.go:825
- This comment is inaccurate for ExecJob, LocalJob, and ComposeJob: cli/config.go only copies
[global] max-runtimeinto RunJob and RunServiceJob, and those other job structs do not expose a per-job MaxRuntime field. As written, maintainers may assume a configurable global/per-job ceiling exists for these job types when the code only applies the hard-coded 24h fallback.
// MaxRuntimeProvider is implemented by job types that expose a per-job
// maximum runtime. Currently RunJob and RunServiceJob satisfy this
// interface; ExecJob, LocalJob, and ComposeJob do not (they inherit
// the bound from defaultJobMaxRuntime or, when wired through cli/config,
// from `[global] max-runtime`).
core/scheduler.go:920
- The new deadline is created once before the retry loop, but
RetryExecutor.ExecuteWithRetrysleeps withtime.Sleepand never checksjctx.Ctx.Done(). If the deadline expires during an attempt or during backoff, the wrapper can keep retrying/sleeping with an already-canceled context, so the configured MaxRuntime is not a hard upper bound for jobs with retries enabled.
boundedCtx, cancelBound := boundJobContext(ctx, w.j, defaultJobMaxRuntime)
defer cancelBound()
jctx := NewContextWithContext(boundedCtx, w.s, w.j, e)
core/scheduler.go:920
- Using MaxRuntime as the parent context for the entire invocation also cancels the context later reused for post-run work. When a run/service job reaches the deadline or finishes close to it, log collection and
deleteContainer(runCtx)/deleteService(runCtx)call Docker with an already-expired context, so cleanup can fail and leave resources behind; use a separate bounded cleanup context or apply MaxRuntime only to the blocking execution phase.
boundedCtx, cancelBound := boundJobContext(ctx, w.j, defaultJobMaxRuntime)
defer cancelBound()
jctx := NewContextWithContext(boundedCtx, w.s, w.j, e)



Summary
Sibling-hunt during PR #636 (which fixed #614 by bounding the cli/web Docker pings) flagged seven more unbounded
context.Background()/context.TODO()sites incore/that could let job execution hang on a wedged Docker upstream. This PR addresses all seven in a single coordinated change (~150 LOC of production code, ~210 LOC of new tests; well under the 400-LOC budget the issue allowed).Sites fixed:
core/common.go:NewContextcontext.Background()core/scheduler.go:maxConcurrentSkipJob.Runcron.Jobdefensive fallbackcore/scheduler.go:jobWrapper.Runcron.Jobdefensive fallbackcore/execjob.go:47ctx.Ctxfallbackcore/runjob.go:107ctx.Ctxfallbackcore/runservice.go:72ctx.Ctxfallbackcore/localjob.go:54ResolveJobEnvironment(env-from-container) unboundedApproach
Three coordinated fixes:
Centralize the nil-fallback in
(*Context).RunContext()so the four inlineif runCtx == nil { runCtx = context.Background() }blocks become a single helper call.NewContextnow delegates toNewContextWithContext(context.TODO(), ...)to flag the unset-parent case to readers and static analyzers.Apply scheduler-level MaxRuntime enforcement in
jobWrapper.runWithCtxvia the newboundJobContexthelper +MaxRuntimeProviderinterface.RunJobandRunServiceJob(which carryMaxRuntimefields) implement the interface and feed their per-job deadline into the bound; all other job types (ExecJob,LocalJob,ComposeJob) inherit thedefaultJobMaxRuntime = 24hceiling. PreviouslyRunJobwas the only job type whose internalstartAndWaitre-wrapped withWithTimeout;ExecJobandRunServiceJobcould hang against a slowdocker exec/service createwith no upper bound.Bound
LocalJob'sResolveJobEnvironmentcall (env-from-container) with a 10slocalJobEnvResolveTimeoutso a wedged Docker daemon cannot stall local-job startup.Also replaces the two defensive
context.Background()fallbacks in thecron.Job(non-context) paths withcontext.TODO()to make their intent explicit.Caveat (documented in CHANGELOG)
[global] max-runtimetoday only inherits intoRunJob/ServiceJobat config-load time (cli/config.go:449,471). Operators wanting a sub-day ceiling onExecJob/LocalJob/ComposeJobstill need a per-jobmax-runtime; widening the inheritance is intentionally left for a follow-up to keep this PR focused.Test plan
TestContext_RunContext_NilFallback/_PassThrough— helper contractTestNewContext_NeverNilCtx— factory invariantTestExecJob_RespectsBoundedParent— executor surfaces context cancellationTestBoundJobContext_FromJobMaxRuntime/_GlobalDefault— helper deadline mathTestJobWrapper_BoundsExecJobByDefaultMaxRuntime— end-to-end regression for Bound remaining unbounded contexts in core/scheduler, core/jobs, and core/common (sibling to #614) #638: feeds the wrapper an UNBOUNDEDcontext.TODO()and asserts the inner Docker mock receives a context with a Deadline, proving the wrapper-level bound is applied even for job types that do not implementMaxRuntimeProviderthemselves.core/context_propagation_test.go(3 tests for ctx-propagation through Exec/Run/Service jobs) still passes.go test ./... -count=1 -short -race— all packages greengolangci-lint run ./...— 0 issuesRefs