Summary
Prevent long-running PR maintenance and conflict-resolution work from blocking the daemon scan loop, so new issues on watched repositories continue to be discovered and dispatched while existing maintenance jobs are in progress.
Problem
- Vigilante currently runs PR maintenance inline in the main scan loop before it starts per-repository issue scanning.
- When maintenance escalates into long-running work such as conflict resolution or coding-agent execution, the daemon does not return to fresh repository scans until that work completes.
- This causes newly opened issues to sit undiscovered even though the daemon process is still running and the watch target remains healthy.
Context
- In
internal/app/app.go, the scan loop runs maintainPullRequests(...) before the watch-target scan phase.
maintainPullRequests(...) can enter maintainOpenPullRequest(...) and trigger long-running downstream work, including conflict-resolution flows that invoke coding agents.
- Recent local evidence showed the daemon stuck with
last_scan_at frozen while new teros-dev/platform issues remained unpicked:
- watch target
teros-dev/platform last scanned at 2026-04-02T02:31:53Z
- issue
#173 opened at 2026-04-02T02:37:42Z
- issue
#174 opened at 2026-04-02T02:39:30Z
- issue
#175 opened at 2026-04-02T02:41:56Z
- Access logs show maintenance
codex exec tasks blocking for multiple minutes at a time, for example:
- issue
#358: 287237ms
- issue
#359: 342912ms and later 211035ms
- issue
#360: 350132ms
- The result is head-of-line blocking: one repository's PR maintenance prevents unrelated watched repositories from being rescanned promptly.
Desired Outcome
- New issue discovery and dispatch scans continue on schedule even while PR maintenance or conflict-resolution work is running for other sessions.
- Long-running maintenance work no longer starves unrelated watch targets of scan time.
- Existing PR maintenance, conflict resolution, and session state transitions continue to work correctly, but they no longer monopolize the daemon loop.
- The fix stays focused on scan-loop scheduling and execution isolation rather than redesigning the entire daemon architecture.
Implementation Notes
- The critical requirement is execution isolation: long-running maintenance operations must not block the next scan pass for watched repositories.
- Acceptable implementation approaches include:
- moving PR maintenance onto a separate worker queue or goroutine pool
- limiting maintenance work per scan tick and resuming incrementally on later ticks
- splitting dispatch/scan orchestration from maintenance orchestration so fresh scans continue independently
- another scheduling design with equivalent non-blocking behavior
- Preserve existing correctness guarantees for session updates, label sync, conflict-resolution handoff, logging, and saved state.
- If concurrency is introduced, protect shared state updates and persistence so sessions are not lost, duplicated, or corrupted.
- Keep the change compatible with existing max-parallel issue-dispatch semantics for watched repositories.
- The fix should also preserve rate-limit handling; reducing scan starvation must not cause uncontrolled concurrent GitHub API spikes.
Acceptance Criteria
Testing Expectations
- Add or update tests that simulate long-running maintenance work and verify that scan passes still proceed for watched repositories.
- Cover a regression case where a new eligible issue opens while maintenance is busy, and verify it is discovered without waiting for maintenance completion.
- Cover concurrency and persistence safety for session updates if maintenance and scanning can now overlap.
- Cover rate-limit or scheduling regressions so the fix does not trade starvation for excessive concurrent GitHub API traffic.
Operational / UX Considerations
- Keep logs and status output clear enough to distinguish ongoing maintenance work from fresh scan activity.
- Prefer a design that preserves predictable scheduling so operators can understand why an issue has or has not been picked up.
- Avoid requiring manual daemon restarts to resume normal scans after long maintenance work begins.
Summary
Prevent long-running PR maintenance and conflict-resolution work from blocking the daemon scan loop, so new issues on watched repositories continue to be discovered and dispatched while existing maintenance jobs are in progress.
Problem
Context
internal/app/app.go, the scan loop runsmaintainPullRequests(...)before the watch-target scan phase.maintainPullRequests(...)can entermaintainOpenPullRequest(...)and trigger long-running downstream work, including conflict-resolution flows that invoke coding agents.last_scan_atfrozen while newteros-dev/platformissues remained unpicked:teros-dev/platformlast scanned at2026-04-02T02:31:53Z#173opened at2026-04-02T02:37:42Z#174opened at2026-04-02T02:39:30Z#175opened at2026-04-02T02:41:56Zcodex exectasks blocking for multiple minutes at a time, for example:#358:287237ms#359:342912msand later211035ms#360:350132msDesired Outcome
Implementation Notes
Acceptance Criteria
last_scan_atcontinues to advance on watched targets during extended maintenance activity.Testing Expectations
Operational / UX Considerations