-
-
Notifications
You must be signed in to change notification settings - Fork 69.2k
fix(pi-embedded): compaction retry blocks session lane + restart collision #17444
Copy link
Copy link
Closed
Description
Summary
Two related reliability issues in the PI embedded runner can make a session (or, in some cases, the gateway) appear “hung”:
- Compaction retry wait blocks the session lane with no aggregate timeout. When auto-compaction retries, the embedded run awaits
waitForCompactionRetry()while still holding the per-session lane. In the worst case this can block the session for ~15 minutes. - SIGUSR1 in-process restart collides with in-flight compaction. Restart deferral + drain timeouts are 30s, but embedded compaction commonly takes 60–90s. The gateway restarts while a compaction run still holds the session write lock, causing new work to queue behind an unreleased lock.
Both issues present as: “messages are accepted but no new replies appear for minutes.”
Incidents (production)
Incident A — Compaction retry blocks lane (2026-02-14 ~04:44–04:47 UTC)
- Session context reached 92%+, auto-compaction triggered.
- Compaction itself succeeded quickly (~7s).
- Tool-heavy execution pushed context back above the threshold in the same run, triggering a second compaction.
- The second compaction entered a retry path; the run awaited
waitForCompactionRetry()with no aggregate timeout, blocking the session lane.
Incident B — SIGUSR1 restart collides with compaction (2026-02-15 ~06:55–06:58 UTC)
- A config patch scheduled/triggered a SIGUSR1 in-process restart.
- An embedded run was already in a compaction phase (observed compaction duration 60–90s).
- Restart deferral/drain budgets were 30s, so the gateway proceeded with restart while compaction was still running.
- After restart, new messages queued behind a session write lock held by the previous lifecycle.
Expected vs actual
Expected
- Auto-compaction should not be able to block a session lane indefinitely.
- SIGUSR1 restart should either (a) wait for compaction to complete, or (b) abort compaction/runs so the next lifecycle can proceed cleanly.
Actual
waitForCompactionRetry()can be awaited while holding the session lane without an upper bound.- SIGUSR1 restart can proceed while embedded runs are still active (especially during compaction), leaving behind locks/state that block new work.
Root cause analysis
Bug 1 — Compaction retry wait blocks session lane (3 compounding issues)
-
Lane is held while waiting
- The embedded run awaits compaction completion/retries inside the per-session lane.
-
No aggregate timeout on retry wait
waitForCompactionRetry()resolves only whenpendingCompactionRetry === 0 && !compactionInFlight.- There is no “total budget” for waiting across retries.
-
Mid-run compaction re-trigger is possible
- Even when the first compaction is fast, tool outputs can re-expand the context above threshold and trigger compaction again in the same run.
Worst case today
- Per-attempt compaction safety timeout: 300s
- Max attempts: 3
- Total blocking: ~15 minutes
Bug 2 — SIGUSR1 restart collides with in-flight compaction (3 compounding issues)
-
Restart deferral max wait is 30s (
src/infra/restart.ts)- The restart scheduler emits SIGUSR1 after ~30s even if
getPendingCount()remains > 0.
- The restart scheduler emits SIGUSR1 after ~30s even if
-
Restart drain timeout is 30s (
src/cli/gateway-cli/run-loop.ts)- After SIGUSR1 is received, the run loop drains active tasks for 30s, then restarts anyway.
-
In-process restart resets lane state without aborting embedded runs
- Lanes are reset for the new iteration, but embedded runs from the old lifecycle may still be active and holding the session write lock.
Code flow (where the blocking happens)
Bug 1: session lane blocking
runEmbeddedPiAgent()
-> enqueueCommandInLane(sessionLane)
-> ...
-> session.prompt(...)
-> await waitForCompactionRetry() <-- blocks while holding session lane
-> return result / unlock lane
Bug 2: restart collision
(config watcher) schedule SIGUSR1
-> deferGatewayRestartUntilIdle(maxWait=30s)
-> SIGUSR1 emitted even if compaction still pending
(gateway run loop) on SIGUSR1
-> waitForActiveTasks(timeout=30s)
-> restart iteration + resetAllLanes()
-> old embedded run still holds session write lock
-> new work queues behind unreleased lock
Proposed fix
Fix 1 — Add an aggregate timeout around compaction retry wait
In the embedded run attempt (after prompt), wrap the wait:
- Add
COMPACTION_RETRY_AGGREGATE_TIMEOUT_MS = 60_000 Promise.race([waitForCompactionRetry(), timeout])- On timeout: log a warning and proceed using the pre-compaction snapshot (
timedOutDuringCompaction = true)
This bounds “lane blocked by compaction retry wait” to ≤ 60s.
Fix 2 — Harden SIGUSR1 restart around in-flight embedded compaction
- Increase restart deferral max wait: 30s → 90s (
DEFAULT_DEFERRAL_MAX_WAIT_MS) - Increase run-loop drain timeout: 30s → 90s (
DRAIN_TIMEOUT_MS) - On SIGUSR1 restart:
- Abort compacting embedded runs (best-effort)
- Drain both active tasks and active embedded runs (up to 90s)
- If drain times out, abort all embedded runs (best-effort) and proceed with restart
Worst-case analysis (after fix)
- Bug 1: session lane blocking from compaction retry wait drops from ~15 minutes → 60 seconds.
- Bug 2: restart collision window increases to match observed compaction duration (60–90s) and proactively aborts compacting runs.
Test plan
Add/extend tests to cover:
- Aggregate timeout fires when compaction retry wait exceeds 60s.
- Timer cleanup (no leaked timeouts).
- SIGUSR1 restart path aborts compacting embedded runs.
- Run-loop drain waits for embedded runs, not only tasks.
- Deferral timeout updated (90s) in
deferGatewayRestartUntilIdle.
Follow-ups (optional)
- Consider reducing
EMBEDDED_COMPACTION_TIMEOUT_MS(currently 300s) if compaction is consistently fast in practice. - Consider a short compaction re-entry cooldown to avoid immediate re-trigger within the same tool-heavy run.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels
Type
Fields
Give feedbackNo fields configured for issues without a type.