fix(core): bound remaining unbounded contexts in scheduler and jobs (#638) by CybotTM · Pull Request #651 · netresearch/ofelia

CybotTM · 2026-05-13T22:01:54Z

Summary

Sibling-hunt during PR #636 (which fixed #614 by bounding the cli/web Docker pings) flagged seven more unbounded context.Background() / context.TODO() sites in core/ that could let job execution hang on a wedged Docker upstream. This PR addresses all seven in a single coordinated change (~150 LOC of production code, ~210 LOC of new tests; well under the 400-LOC budget the issue allowed).

Sites fixed:

File	Symptom
`core/common.go:NewContext`	Foundational — every job inherited `context.Background()`
`core/scheduler.go:maxConcurrentSkipJob.Run`	`cron.Job` defensive fallback
`core/scheduler.go:jobWrapper.Run`	`cron.Job` defensive fallback
`core/execjob.go:47`	nil-`ctx.Ctx` fallback
`core/runjob.go:107`	nil-`ctx.Ctx` fallback
`core/runservice.go:72`	nil-`ctx.Ctx` fallback
`core/localjob.go:54`	`ResolveJobEnvironment` (env-from-container) unbounded

Approach

Three coordinated fixes:

Centralize the nil-fallback in (*Context).RunContext() so the four inline if runCtx == nil { runCtx = context.Background() } blocks become a single helper call. NewContext now delegates to NewContextWithContext(context.TODO(), ...) to flag the unset-parent case to readers and static analyzers.
Apply scheduler-level MaxRuntime enforcement in jobWrapper.runWithCtx via the new boundJobContext helper + MaxRuntimeProvider interface. RunJob and RunServiceJob (which carry MaxRuntime fields) implement the interface and feed their per-job deadline into the bound; all other job types (ExecJob, LocalJob, ComposeJob) inherit the defaultJobMaxRuntime = 24h ceiling. Previously RunJob was the only job type whose internal startAndWait re-wrapped with WithTimeout; ExecJob and RunServiceJob could hang against a slow docker exec / service create with no upper bound.
Bound LocalJob's ResolveJobEnvironment call (env-from-container) with a 10s localJobEnvResolveTimeout so a wedged Docker daemon cannot stall local-job startup.

Also replaces the two defensive context.Background() fallbacks in the cron.Job (non-context) paths with context.TODO() to make their intent explicit.

Caveat (documented in CHANGELOG)

[global] max-runtime today only inherits into RunJob/ServiceJob at config-load time (cli/config.go:449,471). Operators wanting a sub-day ceiling on ExecJob/LocalJob/ComposeJob still need a per-job max-runtime; widening the inheritance is intentionally left for a follow-up to keep this PR focused.

Test plan

New TestContext_RunContext_NilFallback / _PassThrough — helper contract
New TestNewContext_NeverNilCtx — factory invariant
New TestExecJob_RespectsBoundedParent — executor surfaces context cancellation
New TestBoundJobContext_FromJobMaxRuntime / _GlobalDefault — helper deadline math
New TestJobWrapper_BoundsExecJobByDefaultMaxRuntime — end-to-end regression for Bound remaining unbounded contexts in core/scheduler, core/jobs, and core/common (sibling to #614) #638: feeds the wrapper an UNBOUNDED context.TODO() and asserts the inner Docker mock receives a context with a Deadline, proving the wrapper-level bound is applied even for job types that do not implement MaxRuntimeProvider themselves.
Existing core/context_propagation_test.go (3 tests for ctx-propagation through Exec/Run/Service jobs) still passes.
go test ./... -count=1 -short -race — all packages green
golangci-lint run ./... — 0 issues

Refs

Fixes #638
Sibling to #636 (fixed #614)
Original Docker SDK ping bound: #611 (fixed #608)

Sibling-hunt during PR #636 (which fixed #614 by bounding the cli/web Docker pings) flagged seven more unbounded context.Background() / context.TODO() sites in core/ that could let job execution hang on a wedged Docker upstream: - core/common.go:NewContext (foundational; every job inherited) - core/scheduler.go:maxConcurrentSkipJob.Run (cron.Job fallback) - core/scheduler.go:jobWrapper.Run (cron.Job fallback) - core/execjob.go:47 (nil ctx.Ctx fallback) - core/runjob.go:107 (nil ctx.Ctx fallback) - core/runservice.go:72 (nil ctx.Ctx fallback) - core/localjob.go:54 (ResolveJobEnvironment unbounded) Three coordinated fixes: 1. Centralize the nil-fallback in (*Context).RunContext() so the four inline "if runCtx == nil { runCtx = context.Background() }" blocks become a single helper call. NewContext now delegates to NewContextWithContext(context.TODO(), ...) to flag the unset-parent case to readers and static analyzers. 2. Apply scheduler-level MaxRuntime enforcement in jobWrapper.runWithCtx via the new boundJobContext helper + MaxRuntimeProvider interface. RunJob and RunServiceJob (which carry MaxRuntime fields) implement the interface and feed their per-job deadline into the bound; all other job types (ExecJob, LocalJob, ComposeJob) inherit the defaultJobMaxRuntime = 24h ceiling. Previously RunJob was the only job type whose internal startAndWait re-wrapped with WithTimeout; ExecJob and RunServiceJob could hang against a slow docker exec/service create with no upper bound. 3. Bound LocalJob's ResolveJobEnvironment call (env-from-container) with a 10s localJobEnvResolveTimeout so a wedged Docker daemon cannot stall local-job startup. Also replaces the two defensive context.Background() fallbacks in the cron.Job (non-context) paths with context.TODO() to make their intent explicit. Tests: - TestContext_RunContext_NilFallback / _PassThrough: helper contract - TestNewContext_NeverNilCtx: factory invariant - TestExecJob_RespectsBoundedParent: executor surfaces ctx cancel - TestBoundJobContext_FromJobMaxRuntime / _GlobalDefault: helper math - TestJobWrapper_BoundsExecJobByDefaultMaxRuntime: end-to-end regression — feeds the wrapper an UNBOUNDED context.TODO() and asserts the inner Docker mock receives a context with a Deadline, proving the wrapper-level bound is applied even for job types that do not implement MaxRuntimeProvider themselves. Caveat documented in CHANGELOG: [global] max-runtime today only inherits into RunJob/ServiceJob at config-load time (cli/config.go:449,471). Operators wanting a sub-day ceiling on ExecJob/LocalJob/ComposeJob still need a per-job max-runtime; widening the inheritance is left for a follow-up. Refs: #614, #636 Fixes: #638 Signed-off-by: Sebastian Mendel <[email protected]>

sonarqubecloud · 2026-05-13T22:02:44Z

Quality Gate passed

Issues
7 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

github-actions · 2026-05-13T22:03:26Z

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

Scanned Files

None

github-actions

Automated approval for maintainer PR

All automated quality gates passed. See SECURITY_CONTROLS.md for compensating controls.

gemini-code-assist

Code Review

This pull request addresses issue #638 by centralizing and enforcing context timeouts across all job types in the core package. It introduces a MaxRuntimeProvider interface and a boundJobContext helper in the scheduler to ensure that even jobs without explicit MaxRuntime configurations (like ExecJob or LocalJob) are bounded by a default 24-hour timeout. Additionally, it refactors Context creation to use context.TODO() for better visibility of missing parent contexts and adds a 10-second timeout for environment resolution in local jobs. Comprehensive tests were added to verify these bounding behaviors. I have no further feedback to provide.

github-actions · 2026-05-13T22:05:07Z

✅ Mutation Testing Results

Mutation Score: 66.67% (threshold: 60%)

✨ Good job! Mutation score meets the threshold.

What is mutation testing?

Mutation testing measures test quality by introducing small changes (mutations) to the code and checking if tests detect them. A higher score means better test effectiveness.

Killed mutants: Tests caught the mutation (good!)
Survived mutants: Tests missed the mutation (needs improvement)

codecov · 2026-05-13T22:05:35Z

Codecov Report

❌ Patch coverage is 73.33333% with 8 lines in your changes missing coverage. Please review.
✅ Project coverage is 87.82%. Comparing base (c60cd65) to head (d3c8935).

Files with missing lines	Patch %	Lines
core/scheduler.go	76.92%	2 Missing and 1 partial ⚠️
core/common.go	71.42%	1 Missing and 1 partial ⚠️
core/runservice.go	33.33%	2 Missing ⚠️
core/localjob.go	66.66%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #651      +/-   ##
==========================================
- Coverage   87.86%   87.82%   -0.05%     
==========================================
  Files          88       88              
  Lines       10898    10908      +10     
==========================================
+ Hits         9576     9580       +4     
- Misses       1081     1085       +4     
- Partials      241      243       +2

Flag	Coverage Δ
integration	`87.80% <73.33%> (-0.07%)`	⬇️
unittests	`84.87% <73.33%> (-0.03%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copilot

Pull request overview

This PR attempts to bound remaining core scheduler/job contexts so Docker-backed job execution is less likely to hang indefinitely when Docker is wedged.

Changes:

Adds scheduler-level per-run context deadlines via boundJobContext and MaxRuntimeProvider.
Centralizes job context fallback in (*Context).RunContext().
Adds tests and changelog coverage for bounded context behavior.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
`core/scheduler.go`	Adds default/max-runtime context wrapping around job execution.
`core/common.go`	Adds `RunContext()` and non-nil context construction fallback.
`core/execjob.go`	Switches exec job Docker calls to use `RunContext()`.
`core/runjob.go`	Exposes run-job max runtime and uses `RunContext()`.
`core/runservice.go`	Exposes service-job max runtime and uses `RunContext()`.
`core/localjob.go`	Adds a timeout around local job environment resolution.
`core/context_bounded_test.go`	Adds regression tests for context fallback and scheduler deadline wrapping.
`CHANGELOG.md`	Documents the context-bounding fix.

Comments suppressed due to low confidence (3)

core/scheduler.go:825

This comment is inaccurate for ExecJob, LocalJob, and ComposeJob: cli/config.go only copies [global] max-runtime into RunJob and RunServiceJob, and those other job structs do not expose a per-job MaxRuntime field. As written, maintainers may assume a configurable global/per-job ceiling exists for these job types when the code only applies the hard-coded 24h fallback.

// MaxRuntimeProvider is implemented by job types that expose a per-job
// maximum runtime. Currently RunJob and RunServiceJob satisfy this
// interface; ExecJob, LocalJob, and ComposeJob do not (they inherit
// the bound from defaultJobMaxRuntime or, when wired through cli/config,
// from `[global] max-runtime`).

core/scheduler.go:920

The new deadline is created once before the retry loop, but RetryExecutor.ExecuteWithRetry sleeps with time.Sleep and never checks jctx.Ctx.Done(). If the deadline expires during an attempt or during backoff, the wrapper can keep retrying/sleeping with an already-canceled context, so the configured MaxRuntime is not a hard upper bound for jobs with retries enabled.

	boundedCtx, cancelBound := boundJobContext(ctx, w.j, defaultJobMaxRuntime)
	defer cancelBound()

	jctx := NewContextWithContext(boundedCtx, w.s, w.j, e)

core/scheduler.go:920

Using MaxRuntime as the parent context for the entire invocation also cancels the context later reused for post-run work. When a run/service job reaches the deadline or finishes close to it, log collection and deleteContainer(runCtx)/deleteService(runCtx) call Docker with an already-expired context, so cleanup can fail and leave resources behind; use a separate bounded cleanup context or apply MaxRuntime only to the blocking execution phase.

	boundedCtx, cancelBound := boundJobContext(ctx, w.j, defaultJobMaxRuntime)
	defer cancelBound()

	jctx := NewContextWithContext(boundedCtx, w.s, w.j, e)

Copilot AI review requested due to automatic review settings May 13, 2026 22:01

github-actions Bot added documentation Improvements or additions to documentation tests labels May 13, 2026

Copilot started reviewing on behalf of CybotTM May 13, 2026 22:03 View session

github-actions Bot approved these changes May 13, 2026

View reviewed changes

gemini-code-assist Bot reviewed May 13, 2026

View reviewed changes

Copilot AI reviewed May 13, 2026

View reviewed changes

Comment thread core/scheduler.go

Comment thread core/runservice.go

Comment thread CHANGELOG.md

Comment thread core/localjob.go

Comment thread core/scheduler.go

Comment thread core/context_bounded_test.go

CybotTM mentioned this pull request May 13, 2026

core: scheduler MaxRuntime cancellation must actually stop the running container/exec/service #655

Closed

CybotTM added this pull request to the merge queue May 14, 2026

Merged via the queue into main with commit 798c41f May 14, 2026
30 of 31 checks passed

CybotTM deleted the fix/issue-638 branch May 14, 2026 05:38

This was referenced May 14, 2026

fix(core): stop & remove orphaned containers/services on MaxRuntime cancellation #659

Merged

chore: prepare v0.25.0 release #663

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(core): bound remaining unbounded contexts in scheduler and jobs (#638)#651

fix(core): bound remaining unbounded contexts in scheduler and jobs (#638)#651
CybotTM merged 1 commit into
mainfrom
fix/issue-638

CybotTM commented May 13, 2026

Uh oh!

sonarqubecloud Bot commented May 13, 2026

Uh oh!

github-actions Bot commented May 13, 2026

Uh oh!

github-actions Bot left a comment

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

github-actions Bot commented May 13, 2026

Uh oh!

codecov Bot commented May 13, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

CybotTM commented May 13, 2026

Summary

Approach

Caveat (documented in CHANGELOG)

Test plan

Refs

Uh oh!

sonarqubecloud Bot commented May 13, 2026

Quality Gate passed

Uh oh!

github-actions Bot commented May 13, 2026

Dependency Review

Scanned Files

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

github-actions Bot commented May 13, 2026

✅ Mutation Testing Results

Uh oh!

codecov Bot commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov Bot commented May 13, 2026 •

edited

Loading