Skip to content

Prevent long-running PR maintenance from starving fresh repo scans #382

@nicobistolfi

Description

@nicobistolfi

Summary

Prevent long-running PR maintenance and conflict-resolution work from blocking the daemon scan loop, so new issues on watched repositories continue to be discovered and dispatched while existing maintenance jobs are in progress.

Problem

  • Vigilante currently runs PR maintenance inline in the main scan loop before it starts per-repository issue scanning.
  • When maintenance escalates into long-running work such as conflict resolution or coding-agent execution, the daemon does not return to fresh repository scans until that work completes.
  • This causes newly opened issues to sit undiscovered even though the daemon process is still running and the watch target remains healthy.

Context

  • In internal/app/app.go, the scan loop runs maintainPullRequests(...) before the watch-target scan phase.
  • maintainPullRequests(...) can enter maintainOpenPullRequest(...) and trigger long-running downstream work, including conflict-resolution flows that invoke coding agents.
  • Recent local evidence showed the daemon stuck with last_scan_at frozen while new teros-dev/platform issues remained unpicked:
    • watch target teros-dev/platform last scanned at 2026-04-02T02:31:53Z
    • issue #173 opened at 2026-04-02T02:37:42Z
    • issue #174 opened at 2026-04-02T02:39:30Z
    • issue #175 opened at 2026-04-02T02:41:56Z
  • Access logs show maintenance codex exec tasks blocking for multiple minutes at a time, for example:
    • issue #358: 287237ms
    • issue #359: 342912ms and later 211035ms
    • issue #360: 350132ms
  • The result is head-of-line blocking: one repository's PR maintenance prevents unrelated watched repositories from being rescanned promptly.

Desired Outcome

  • New issue discovery and dispatch scans continue on schedule even while PR maintenance or conflict-resolution work is running for other sessions.
  • Long-running maintenance work no longer starves unrelated watch targets of scan time.
  • Existing PR maintenance, conflict resolution, and session state transitions continue to work correctly, but they no longer monopolize the daemon loop.
  • The fix stays focused on scan-loop scheduling and execution isolation rather than redesigning the entire daemon architecture.

Implementation Notes

  • The critical requirement is execution isolation: long-running maintenance operations must not block the next scan pass for watched repositories.
  • Acceptable implementation approaches include:
    • moving PR maintenance onto a separate worker queue or goroutine pool
    • limiting maintenance work per scan tick and resuming incrementally on later ticks
    • splitting dispatch/scan orchestration from maintenance orchestration so fresh scans continue independently
    • another scheduling design with equivalent non-blocking behavior
  • Preserve existing correctness guarantees for session updates, label sync, conflict-resolution handoff, logging, and saved state.
  • If concurrency is introduced, protect shared state updates and persistence so sessions are not lost, duplicated, or corrupted.
  • Keep the change compatible with existing max-parallel issue-dispatch semantics for watched repositories.
  • The fix should also preserve rate-limit handling; reducing scan starvation must not cause uncontrolled concurrent GitHub API spikes.

Acceptance Criteria

  • Long-running PR maintenance or conflict-resolution work does not block fresh watched-repository scans for unrelated repositories.
  • A newly opened eligible issue on a watched repository can be discovered and dispatched while another session's maintenance work is still running.
  • last_scan_at continues to advance on watched targets during extended maintenance activity.
  • Session state, label synchronization, and maintenance outcomes remain correct after the scheduling change.
  • The implementation does not introduce duplicate dispatches, lost session updates, or unsafe concurrent state writes.

Testing Expectations

  • Add or update tests that simulate long-running maintenance work and verify that scan passes still proceed for watched repositories.
  • Cover a regression case where a new eligible issue opens while maintenance is busy, and verify it is discovered without waiting for maintenance completion.
  • Cover concurrency and persistence safety for session updates if maintenance and scanning can now overlap.
  • Cover rate-limit or scheduling regressions so the fix does not trade starvation for excessive concurrent GitHub API traffic.

Operational / UX Considerations

  • Keep logs and status output clear enough to distinguish ongoing maintenance work from fresh scan activity.
  • Prefer a design that preserves predictable scheduling so operators can understand why an issue has or has not been picked up.
  • Avoid requiring manual daemon restarts to resume normal scans after long maintenance work begins.

Metadata

Metadata

Assignees

Labels

vigilante:doneVigilante completed its work on the issue and no further automation is expected.

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions