Skip to content

Recover stale sessions by reconciling local state with GitHub issue and PR status #322

@nicobistolfi

Description

@nicobistolfi

Summary

Vigilante can leave a session marked running in local state even after the implementation actually completed and the corresponding GitHub issue/PR moved on. Add stale-session recovery that re-checks GitHub issue and pull request status so the daemon can reconcile bad local state instead of continuing to report false stale-running sessions.

Problem

  • Vigilante can persist a session as running in ~/.vigilante/sessions.json even though the implementation finished successfully.
  • vigilante status then reports a stale session based on old last_heartbeat_at / updated_at data, even when GitHub already shows the issue and PR in a terminal or clearly advanced state.
  • This creates misleading operational output and can block or confuse later automation because local state is treated as the source of truth when it is actually stale.
  • This matters now because the failure mode is not self-healing: an operator has to manually run vigilante cleanup to clear state that Vigilante should be able to reconcile safely on its own.

Context

  • Observed bad scenario for aliengiraffe/vigilante issue #299:
    • vigilante logs --repo aliengiraffe/vigilante --issue 299 shows the session succeeded at 2026-03-26 10:39:35 AM PDT and opened PR #319.
    • ~/.vigilante/sessions.json still showed issue #299 as status: "running" with updated_at and last_heartbeat_at stuck at 2026-03-26T17:11:08Z.
    • vigilante status reported Stale sessions (1) and listed Issue #299 in aliengiraffe/vigilante: running.
  • GitHub state for the same issue had already moved on:
    • issue #299 was closed
    • the issue carried vigilante:done
    • the implementation log had already recorded successful completion and PR creation
  • The likely trigger was a later scheduler wedge on another issue, which prevented the daemon from reconciling already-finished sessions before the process got stuck.

Desired Outcome

  • Vigilante should detect when a supposedly running or stale session no longer matches GitHub reality and recover it automatically.
  • Recovery should reconcile session state by re-checking the GitHub issue and any associated PR before continuing to present the session as running.
  • vigilante status should stop reporting false stale-running sessions when the remote issue/PR state clearly indicates the work finished, moved to PR maintenance, was closed, or otherwise no longer belongs in running.
  • Manual cleanup should remain available, but it should no longer be required for this recoverable state-drift scenario.
  • Do not broaden this issue into redesigning the whole scheduler or provider lifecycle; keep the fix focused on stale session reconciliation.

Implementation Notes

  • Treat this as a bug in stale-session recovery, not as a documentation-only problem.
  • When a session is considered stale, reconcile it against GitHub before reporting or persisting it as still running:
    • fetch the issue state and labels
    • fetch the PR state if the session has a PR number or if one can be resolved from the session branch
    • use that remote state to determine whether the session should transition to success, closed, PR-maintenance tracking, or another non-running state
  • Use the existing session evidence when available, including known PR number, branch, issue labels, and any per-issue session log signals, but GitHub issue/PR state should be the deciding factor when local state is obviously stale.
  • Required: the stale-session path must re-check GitHub issue and PR status before continuing to report the session as running.
  • Flexible: whether reconciliation happens during daemon scans, vigilante status, stale-session detection, daemon startup recovery, or a shared recovery helper used by all of those paths.
  • Preserve safe behavior when GitHub is temporarily unavailable: do not silently invent terminal states if remote reconciliation failed.

Acceptance Criteria

  • When a session is marked running locally but the associated GitHub issue/PR state shows the work has already completed or transitioned, Vigilante reconciles the session out of running automatically.
  • vigilante status does not continue to report a stale-running session for the reproduced #299-style scenario once GitHub reconciliation succeeds.
  • Stale-session recovery re-checks GitHub issue status and PR status before leaving a stale session in running.
  • Recovery handles at least these cases coherently: issue closed/done, PR open, PR merged, issue deleted/unavailable, and remote state unavailable due to GitHub/API failure.
  • The recovered session state is persisted back to sessions.json so the same stale warning does not reappear on the next command or scan.
  • Existing manual cleanup flows still work, but are no longer the only way to clear recoverable stale state drift.

Testing Expectations

  • Add or update tests around stale session detection and recovery in the app/status/daemon paths that read sessions.json.
  • Include a regression test that reproduces the observed #299 scenario: local session remains running, per-issue log indicates success, and GitHub issue/PR state indicates completion.
  • Add coverage for the GitHub reconciliation branches, including issue closed with vigilante:done, open PR maintenance, merged PR, and GitHub lookup failure.
  • Validate that recovered state is persisted and that vigilante status output changes from stale-running to the reconciled state.

Operational / UX Considerations

  • Prefer self-healing behavior over operator-only cleanup when the remote GitHub state makes the correct outcome clear.
  • Keep status output trustworthy: if Vigilante says a session is still running, that should mean there is real evidence of active work rather than just stale local JSON.
  • If reconciliation changes a session state automatically, log that transition clearly so operators can understand why the stale warning disappeared.

Metadata

Metadata

Assignees

Labels

vigilante:doneVigilante completed its work on the issue and no further automation is expected.

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions