Skip to content

fix(core): bound remaining unbounded contexts in scheduler and jobs (#638)#651

Merged
CybotTM merged 1 commit into
mainfrom
fix/issue-638
May 14, 2026
Merged

fix(core): bound remaining unbounded contexts in scheduler and jobs (#638)#651
CybotTM merged 1 commit into
mainfrom
fix/issue-638

Conversation

@CybotTM

@CybotTM CybotTM commented May 13, 2026

Copy link
Copy Markdown
Member

Summary

Sibling-hunt during PR #636 (which fixed #614 by bounding the cli/web Docker pings) flagged seven more unbounded context.Background() / context.TODO() sites in core/ that could let job execution hang on a wedged Docker upstream. This PR addresses all seven in a single coordinated change (~150 LOC of production code, ~210 LOC of new tests; well under the 400-LOC budget the issue allowed).

Sites fixed:

File Symptom
core/common.go:NewContext Foundational — every job inherited context.Background()
core/scheduler.go:maxConcurrentSkipJob.Run cron.Job defensive fallback
core/scheduler.go:jobWrapper.Run cron.Job defensive fallback
core/execjob.go:47 nil-ctx.Ctx fallback
core/runjob.go:107 nil-ctx.Ctx fallback
core/runservice.go:72 nil-ctx.Ctx fallback
core/localjob.go:54 ResolveJobEnvironment (env-from-container) unbounded

Approach

Three coordinated fixes:

  1. Centralize the nil-fallback in (*Context).RunContext() so the four inline if runCtx == nil { runCtx = context.Background() } blocks become a single helper call. NewContext now delegates to NewContextWithContext(context.TODO(), ...) to flag the unset-parent case to readers and static analyzers.

  2. Apply scheduler-level MaxRuntime enforcement in jobWrapper.runWithCtx via the new boundJobContext helper + MaxRuntimeProvider interface. RunJob and RunServiceJob (which carry MaxRuntime fields) implement the interface and feed their per-job deadline into the bound; all other job types (ExecJob, LocalJob, ComposeJob) inherit the defaultJobMaxRuntime = 24h ceiling. Previously RunJob was the only job type whose internal startAndWait re-wrapped with WithTimeout; ExecJob and RunServiceJob could hang against a slow docker exec / service create with no upper bound.

  3. Bound LocalJob's ResolveJobEnvironment call (env-from-container) with a 10s localJobEnvResolveTimeout so a wedged Docker daemon cannot stall local-job startup.

Also replaces the two defensive context.Background() fallbacks in the cron.Job (non-context) paths with context.TODO() to make their intent explicit.

Caveat (documented in CHANGELOG)

[global] max-runtime today only inherits into RunJob/ServiceJob at config-load time (cli/config.go:449,471). Operators wanting a sub-day ceiling on ExecJob/LocalJob/ComposeJob still need a per-job max-runtime; widening the inheritance is intentionally left for a follow-up to keep this PR focused.

Test plan

  • New TestContext_RunContext_NilFallback / _PassThrough — helper contract
  • New TestNewContext_NeverNilCtx — factory invariant
  • New TestExecJob_RespectsBoundedParent — executor surfaces context cancellation
  • New TestBoundJobContext_FromJobMaxRuntime / _GlobalDefault — helper deadline math
  • New TestJobWrapper_BoundsExecJobByDefaultMaxRuntimeend-to-end regression for Bound remaining unbounded contexts in core/scheduler, core/jobs, and core/common (sibling to #614) #638: feeds the wrapper an UNBOUNDED context.TODO() and asserts the inner Docker mock receives a context with a Deadline, proving the wrapper-level bound is applied even for job types that do not implement MaxRuntimeProvider themselves.
  • Existing core/context_propagation_test.go (3 tests for ctx-propagation through Exec/Run/Service jobs) still passes.
  • go test ./... -count=1 -short -race — all packages green
  • golangci-lint run ./... — 0 issues

Refs

Sibling-hunt during PR #636 (which fixed #614 by bounding the cli/web
Docker pings) flagged seven more unbounded context.Background() /
context.TODO() sites in core/ that could let job execution hang on a
wedged Docker upstream:

  - core/common.go:NewContext (foundational; every job inherited)
  - core/scheduler.go:maxConcurrentSkipJob.Run (cron.Job fallback)
  - core/scheduler.go:jobWrapper.Run (cron.Job fallback)
  - core/execjob.go:47 (nil ctx.Ctx fallback)
  - core/runjob.go:107 (nil ctx.Ctx fallback)
  - core/runservice.go:72 (nil ctx.Ctx fallback)
  - core/localjob.go:54 (ResolveJobEnvironment unbounded)

Three coordinated fixes:

1. Centralize the nil-fallback in (*Context).RunContext() so the four
   inline "if runCtx == nil { runCtx = context.Background() }" blocks
   become a single helper call. NewContext now delegates to
   NewContextWithContext(context.TODO(), ...) to flag the unset-parent
   case to readers and static analyzers.

2. Apply scheduler-level MaxRuntime enforcement in jobWrapper.runWithCtx
   via the new boundJobContext helper + MaxRuntimeProvider interface.
   RunJob and RunServiceJob (which carry MaxRuntime fields) implement
   the interface and feed their per-job deadline into the bound; all
   other job types (ExecJob, LocalJob, ComposeJob) inherit the
   defaultJobMaxRuntime = 24h ceiling. Previously RunJob was the only
   job type whose internal startAndWait re-wrapped with WithTimeout;
   ExecJob and RunServiceJob could hang against a slow docker
   exec/service create with no upper bound.

3. Bound LocalJob's ResolveJobEnvironment call (env-from-container)
   with a 10s localJobEnvResolveTimeout so a wedged Docker daemon
   cannot stall local-job startup.

Also replaces the two defensive context.Background() fallbacks in the
cron.Job (non-context) paths with context.TODO() to make their intent
explicit.

Tests:

  - TestContext_RunContext_NilFallback / _PassThrough: helper contract
  - TestNewContext_NeverNilCtx: factory invariant
  - TestExecJob_RespectsBoundedParent: executor surfaces ctx cancel
  - TestBoundJobContext_FromJobMaxRuntime / _GlobalDefault: helper math
  - TestJobWrapper_BoundsExecJobByDefaultMaxRuntime: end-to-end
    regression — feeds the wrapper an UNBOUNDED context.TODO() and
    asserts the inner Docker mock receives a context with a Deadline,
    proving the wrapper-level bound is applied even for job types
    that do not implement MaxRuntimeProvider themselves.

Caveat documented in CHANGELOG: [global] max-runtime today only
inherits into RunJob/ServiceJob at config-load time
(cli/config.go:449,471). Operators wanting a sub-day ceiling on
ExecJob/LocalJob/ComposeJob still need a per-job max-runtime; widening
the inheritance is left for a follow-up.

Refs: #614, #636
Fixes: #638

Signed-off-by: Sebastian Mendel <[email protected]>
Copilot AI review requested due to automatic review settings May 13, 2026 22:01
@sonarqubecloud

Copy link
Copy Markdown

@github-actions github-actions Bot added documentation Improvements or additions to documentation tests labels May 13, 2026
@github-actions

Copy link
Copy Markdown

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

Scanned Files

None

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated approval for maintainer PR

All automated quality gates passed. See SECURITY_CONTROLS.md for compensating controls.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses issue #638 by centralizing and enforcing context timeouts across all job types in the core package. It introduces a MaxRuntimeProvider interface and a boundJobContext helper in the scheduler to ensure that even jobs without explicit MaxRuntime configurations (like ExecJob or LocalJob) are bounded by a default 24-hour timeout. Additionally, it refactors Context creation to use context.TODO() for better visibility of missing parent contexts and adds a 10-second timeout for environment resolution in local jobs. Comprehensive tests were added to verify these bounding behaviors. I have no further feedback to provide.

@github-actions

Copy link
Copy Markdown

✅ Mutation Testing Results

Mutation Score: 66.67% (threshold: 60%)

✨ Good job! Mutation score meets the threshold.

What is mutation testing?

Mutation testing measures test quality by introducing small changes (mutations) to the code and checking if tests detect them. A higher score means better test effectiveness.

  • Killed mutants: Tests caught the mutation (good!)
  • Survived mutants: Tests missed the mutation (needs improvement)

@codecov

codecov Bot commented May 13, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 73.33333% with 8 lines in your changes missing coverage. Please review.
✅ Project coverage is 87.82%. Comparing base (c60cd65) to head (d3c8935).

Files with missing lines Patch % Lines
core/scheduler.go 76.92% 2 Missing and 1 partial ⚠️
core/common.go 71.42% 1 Missing and 1 partial ⚠️
core/runservice.go 33.33% 2 Missing ⚠️
core/localjob.go 66.66% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #651      +/-   ##
==========================================
- Coverage   87.86%   87.82%   -0.05%     
==========================================
  Files          88       88              
  Lines       10898    10908      +10     
==========================================
+ Hits         9576     9580       +4     
- Misses       1081     1085       +4     
- Partials      241      243       +2     
Flag Coverage Δ
integration 87.80% <73.33%> (-0.07%) ⬇️
unittests 84.87% <73.33%> (-0.03%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR attempts to bound remaining core scheduler/job contexts so Docker-backed job execution is less likely to hang indefinitely when Docker is wedged.

Changes:

  • Adds scheduler-level per-run context deadlines via boundJobContext and MaxRuntimeProvider.
  • Centralizes job context fallback in (*Context).RunContext().
  • Adds tests and changelog coverage for bounded context behavior.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
core/scheduler.go Adds default/max-runtime context wrapping around job execution.
core/common.go Adds RunContext() and non-nil context construction fallback.
core/execjob.go Switches exec job Docker calls to use RunContext().
core/runjob.go Exposes run-job max runtime and uses RunContext().
core/runservice.go Exposes service-job max runtime and uses RunContext().
core/localjob.go Adds a timeout around local job environment resolution.
core/context_bounded_test.go Adds regression tests for context fallback and scheduler deadline wrapping.
CHANGELOG.md Documents the context-bounding fix.
Comments suppressed due to low confidence (3)

core/scheduler.go:825

  • This comment is inaccurate for ExecJob, LocalJob, and ComposeJob: cli/config.go only copies [global] max-runtime into RunJob and RunServiceJob, and those other job structs do not expose a per-job MaxRuntime field. As written, maintainers may assume a configurable global/per-job ceiling exists for these job types when the code only applies the hard-coded 24h fallback.
// MaxRuntimeProvider is implemented by job types that expose a per-job
// maximum runtime. Currently RunJob and RunServiceJob satisfy this
// interface; ExecJob, LocalJob, and ComposeJob do not (they inherit
// the bound from defaultJobMaxRuntime or, when wired through cli/config,
// from `[global] max-runtime`).

core/scheduler.go:920

  • The new deadline is created once before the retry loop, but RetryExecutor.ExecuteWithRetry sleeps with time.Sleep and never checks jctx.Ctx.Done(). If the deadline expires during an attempt or during backoff, the wrapper can keep retrying/sleeping with an already-canceled context, so the configured MaxRuntime is not a hard upper bound for jobs with retries enabled.
	boundedCtx, cancelBound := boundJobContext(ctx, w.j, defaultJobMaxRuntime)
	defer cancelBound()

	jctx := NewContextWithContext(boundedCtx, w.s, w.j, e)

core/scheduler.go:920

  • Using MaxRuntime as the parent context for the entire invocation also cancels the context later reused for post-run work. When a run/service job reaches the deadline or finishes close to it, log collection and deleteContainer(runCtx)/deleteService(runCtx) call Docker with an already-expired context, so cleanup can fail and leave resources behind; use a separate bounded cleanup context or apply MaxRuntime only to the blocking execution phase.
	boundedCtx, cancelBound := boundJobContext(ctx, w.j, defaultJobMaxRuntime)
	defer cancelBound()

	jctx := NewContextWithContext(boundedCtx, w.s, w.j, e)

Comment thread core/scheduler.go
Comment thread core/runservice.go
Comment thread CHANGELOG.md
Comment thread core/localjob.go
Comment thread core/scheduler.go
Comment thread core/context_bounded_test.go
@CybotTM CybotTM added this pull request to the merge queue May 14, 2026
Merged via the queue into main with commit 798c41f May 14, 2026
30 of 31 checks passed
@CybotTM CybotTM deleted the fix/issue-638 branch May 14, 2026 05:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation tests

Projects

None yet

2 participants