fix(pi-embedded): compaction retry blocks session lane + restart collision

## Summary

Two related reliability issues in the **PI embedded runner** can make a session (or, in some cases, the gateway) appear “hung”:

1. **Compaction retry wait blocks the session lane with no aggregate timeout.** When auto-compaction retries, the embedded run awaits `waitForCompactionRetry()` while still holding the per-session lane. In the worst case this can block the session for ~15 minutes.
2. **SIGUSR1 in-process restart collides with in-flight compaction.** Restart deferral + drain timeouts are 30s, but embedded compaction commonly takes 60–90s. The gateway restarts while a compaction run still holds the session write lock, causing new work to queue behind an unreleased lock.

Both issues present as: “messages are accepted but no new replies appear for minutes.”

---

## Incidents (production)

### Incident A — Compaction retry blocks lane (2026-02-14 ~04:44–04:47 UTC)

- Session context reached **92%+**, auto-compaction triggered.
- Compaction itself succeeded quickly (**~7s**).
- Tool-heavy execution pushed context back above the threshold in the **same run**, triggering a **second compaction**.
- The second compaction entered a retry path; the run awaited `waitForCompactionRetry()` with **no aggregate timeout**, blocking the session lane.

### Incident B — SIGUSR1 restart collides with compaction (2026-02-15 ~06:55–06:58 UTC)

- A config patch scheduled/triggered a SIGUSR1 in-process restart.
- An embedded run was already in a compaction phase (observed compaction duration **60–90s**).
- Restart deferral/drain budgets were **30s**, so the gateway proceeded with restart while compaction was still running.
- After restart, new messages queued behind a session write lock held by the previous lifecycle.

---

## Expected vs actual

**Expected**
- Auto-compaction should not be able to block a session lane indefinitely.
- SIGUSR1 restart should either (a) wait for compaction to complete, or (b) abort compaction/runs so the next lifecycle can proceed cleanly.

**Actual**
- `waitForCompactionRetry()` can be awaited while holding the session lane without an upper bound.
- SIGUSR1 restart can proceed while embedded runs are still active (especially during compaction), leaving behind locks/state that block new work.

---

## Root cause analysis

### Bug 1 — Compaction retry wait blocks session lane (3 compounding issues)

1. **Lane is held while waiting**
   - The embedded run awaits compaction completion/retries **inside** the per-session lane.

2. **No aggregate timeout on retry wait**
   - `waitForCompactionRetry()` resolves only when `pendingCompactionRetry === 0 && !compactionInFlight`.
   - There is no “total budget” for waiting across retries.

3. **Mid-run compaction re-trigger is possible**
   - Even when the first compaction is fast, tool outputs can re-expand the context above threshold and trigger compaction again in the same run.

**Worst case today**
- Per-attempt compaction safety timeout: **300s**
- Max attempts: **3**
- Total blocking: **~15 minutes**

### Bug 2 — SIGUSR1 restart collides with in-flight compaction (3 compounding issues)

1. **Restart deferral max wait is 30s** (`src/infra/restart.ts`)
   - The restart scheduler emits SIGUSR1 after ~30s even if `getPendingCount()` remains > 0.

2. **Restart drain timeout is 30s** (`src/cli/gateway-cli/run-loop.ts`)
   - After SIGUSR1 is received, the run loop drains active tasks for 30s, then restarts anyway.

3. **In-process restart resets lane state without aborting embedded runs**
   - Lanes are reset for the new iteration, but embedded runs from the old lifecycle may still be active and holding the session write lock.

---

## Code flow (where the blocking happens)

### Bug 1: session lane blocking

```text
runEmbeddedPiAgent()
  -> enqueueCommandInLane(sessionLane)
    -> ...
    -> session.prompt(...)
    -> await waitForCompactionRetry()   <-- blocks while holding session lane
    -> return result / unlock lane
```

### Bug 2: restart collision

```text
(config watcher) schedule SIGUSR1
  -> deferGatewayRestartUntilIdle(maxWait=30s)
  -> SIGUSR1 emitted even if compaction still pending

(gateway run loop) on SIGUSR1
  -> waitForActiveTasks(timeout=30s)
  -> restart iteration + resetAllLanes()
  -> old embedded run still holds session write lock
  -> new work queues behind unreleased lock
```

---

## Proposed fix

### Fix 1 — Add an aggregate timeout around compaction retry wait

In the embedded run attempt (after prompt), wrap the wait:

- Add `COMPACTION_RETRY_AGGREGATE_TIMEOUT_MS = 60_000`
- `Promise.race([waitForCompactionRetry(), timeout])`
- On timeout: log a warning and proceed using the pre-compaction snapshot (`timedOutDuringCompaction = true`)

This bounds “lane blocked by compaction retry wait” to **≤ 60s**.

### Fix 2 — Harden SIGUSR1 restart around in-flight embedded compaction

- Increase restart deferral max wait: **30s → 90s** (`DEFAULT_DEFERRAL_MAX_WAIT_MS`)
- Increase run-loop drain timeout: **30s → 90s** (`DRAIN_TIMEOUT_MS`)
- On SIGUSR1 restart:
  1. Abort **compacting** embedded runs (best-effort)
  2. Drain **both** active tasks and active embedded runs (up to 90s)
  3. If drain times out, abort **all** embedded runs (best-effort) and proceed with restart

---

## Worst-case analysis (after fix)

- **Bug 1:** session lane blocking from compaction retry wait drops from **~15 minutes → 60 seconds**.
- **Bug 2:** restart collision window increases to match observed compaction duration (60–90s) and proactively aborts compacting runs.

---

## Test plan

Add/extend tests to cover:

- Aggregate timeout fires when compaction retry wait exceeds 60s.
- Timer cleanup (no leaked timeouts).
- SIGUSR1 restart path aborts compacting embedded runs.
- Run-loop drain waits for embedded runs, not only tasks.
- Deferral timeout updated (90s) in `deferGatewayRestartUntilIdle`.

---

## Follow-ups (optional)

- Consider reducing `EMBEDDED_COMPACTION_TIMEOUT_MS` (currently 300s) if compaction is consistently fast in practice.
- Consider a short compaction re-entry cooldown to avoid immediate re-trigger within the same tool-heavy run.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(pi-embedded): compaction retry blocks session lane + restart collision #17444

Summary

Incidents (production)

Incident A — Compaction retry blocks lane (2026-02-14 ~04:44–04:47 UTC)

Incident B — SIGUSR1 restart collides with compaction (2026-02-15 ~06:55–06:58 UTC)

Expected vs actual

Root cause analysis

Bug 1 — Compaction retry wait blocks session lane (3 compounding issues)

Bug 2 — SIGUSR1 restart collides with in-flight compaction (3 compounding issues)

Code flow (where the blocking happens)

Bug 1: session lane blocking

Bug 2: restart collision

Proposed fix

Fix 1 — Add an aggregate timeout around compaction retry wait

Fix 2 — Harden SIGUSR1 restart around in-flight embedded compaction

Worst-case analysis (after fix)

Test plan

Follow-ups (optional)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

fix(pi-embedded): compaction retry blocks session lane + restart collision #17444

Description

Summary

Incidents (production)

Incident A — Compaction retry blocks lane (2026-02-14 ~04:44–04:47 UTC)

Incident B — SIGUSR1 restart collides with compaction (2026-02-15 ~06:55–06:58 UTC)

Expected vs actual

Root cause analysis

Bug 1 — Compaction retry wait blocks session lane (3 compounding issues)

Bug 2 — SIGUSR1 restart collides with in-flight compaction (3 compounding issues)

Code flow (where the blocking happens)

Bug 1: session lane blocking

Bug 2: restart collision

Proposed fix

Fix 1 — Add an aggregate timeout around compaction retry wait

Fix 2 — Harden SIGUSR1 restart around in-flight embedded compaction

Worst-case analysis (after fix)

Test plan

Follow-ups (optional)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions