Skip to content

fix(pi-embedded): add aggregate timeout to compaction retry + harden SIGUSR1 restart#17445

Closed
joeykrug wants to merge 4 commits intoopenclaw:mainfrom
joeykrug:fix/compaction-lane-blocking
Closed

fix(pi-embedded): add aggregate timeout to compaction retry + harden SIGUSR1 restart#17445
joeykrug wants to merge 4 commits intoopenclaw:mainfrom
joeykrug:fix/compaction-lane-blocking

Conversation

@joeykrug
Copy link
Copy Markdown
Contributor

@joeykrug joeykrug commented Feb 15, 2026

Fixes #17444

Summary

This PR addresses two related "session looks hung" failure modes in the PI embedded runner:

  1. Compaction retry waits can block the session lane with no aggregate timeout.
  2. SIGUSR1 in-process restart can proceed while embedded compaction is still in-flight, leaving session write locks held by the previous lifecycle.

What changed

1) Bound compaction retry waits (aggregate timeout)

  • Add waitForCompactionRetryWithAggregateTimeout() in run/compaction-retry-aggregate-timeout.ts.
  • In run/attempt.ts, replace the unbounded await abortable(waitForCompactionRetry()) with a 60s aggregate timeout.
  • On aggregate timeout:
    • set timedOutDuringCompaction = true
    • log a warning and proceed using the existing pre-compaction snapshot selection logic
  • Defensive normalization: non-finite aggregateTimeoutMs values fall back to 1ms rather than producing NaN timeouts.
  • Timer cleanup uses timer !== undefined (not truthiness) to handle fake-timer environments where handle can be 0.

This prevents long compaction retry waits from freezing the session lane.

2) Harden SIGUSR1 restart against in-flight embedded runs/compaction

  • Increase default SIGUSR1 deferral budget: DEFAULT_DEFERRAL_MAX_WAIT_MS 30s → 90s.
  • Increase gateway run-loop drain budget: DRAIN_TIMEOUT_MS 30s → 90s.
  • On SIGUSR1 restart in the gateway run loop:
    1. Abort compacting embedded runs (best-effort, surgical — only runs where handle.isCompacting() is true)
    2. Drain both active command-queue tasks and active embedded runs (up to 90s)
    3. If drain times out, abort all embedded runs (best-effort) and proceed

3) Exception-safe abort handling

  • All three abortEmbeddedPiRun() code paths (single session, compacting-only, abort-all) now wrap handle.abort() in try/catch.
  • A thrown abort handler no longer prevents server.close() from executing during SIGUSR1 restart drain.
  • Failed aborts are logged at warn level with the session ID and error.

Why this is safe

  • The aggregate timeout only affects cases where compaction retry waits would otherwise stall; normal compaction runs are unchanged.
  • Restart already implies disruption; aborting compacting runs is a targeted best-effort to ensure session locks are released.
  • Exception guards on abort are strictly additive — they only prevent cascading failures.
  • Changes are localized to restart/compaction wait paths.

Tests

New:

  • src/agents/pi-embedded-runner/run/compaction-retry-aggregate-timeout.e2e.test.ts — aggregate timeout helper
  • src/agents/pi-embedded-runner/runs.e2e.test.ts — surgical abort + waitForActiveEmbeddedRuns

Updated:

  • src/cli/gateway-cli/run-loop.test.ts — drain timeout constant
  • src/infra/infra-runtime.test.ts — deferral timeout constant

Note: e2e tests require -c vitest.e2e.config.ts (excluded from default config).

pnpm -s lint
pnpm -s format:check
npx vitest run src/agents/pi-embedded-runner/ src/cli/gateway-cli/run-loop.test.ts src/infra/infra-runtime.test.ts --reporter=verbose
npx vitest run -c vitest.e2e.config.ts src/agents/pi-embedded-runner/runs.e2e.test.ts src/agents/pi-embedded-runner/run/compaction-retry-aggregate-timeout.e2e.test.ts --reporter=verbose

Notes / follow-ups

  • This PR does not change per-attempt compaction safety timeouts (EMBEDDED_COMPACTION_TIMEOUT_MS); if desired, we can follow up with a smaller per-attempt ceiling and/or a compaction re-entry cooldown once we have more production data.
  • Refactor commits (dedupe compaction failure, dedupe assistant delta parsing, reuse isPidAlive) are included to reduce code duplication touched during the investigation.

Greptile Summary

This PR fixes two related "session hung" failure modes in the PI embedded runner:

  • Aggregate timeout for compaction retry waits: Adds a 60s aggregate timeout to waitForCompactionRetry() in the embedded run attempt loop, preventing unbounded blocking of the session lane. On timeout, the run proceeds with the pre-compaction snapshot. The new waitForCompactionRetryWithAggregateTimeout helper uses Promise.race with proper timer cleanup and defensive normalization for non-finite timeout values.
  • Hardened SIGUSR1 restart against in-flight compaction: The gateway run loop now surgically aborts compacting embedded runs before draining, waits for both active tasks and embedded runs in parallel (up to 90s), and falls back to aborting all runs if the drain times out. The deferral budget and drain timeout are increased from 30s to 90s to match observed production compaction durations.
  • Exception-safe abort handling: All abortEmbeddedPiRun code paths now wrap handle.abort() in try/catch, preventing a thrown abort handler from cascading into the shutdown sequence.

Changes are well-localized to restart/compaction paths, and existing behavior for normal compaction runs is unchanged. Test coverage is thorough with new e2e tests for the aggregate timeout helper and surgical abort modes.

Confidence Score: 5/5

  • This PR is safe to merge — changes are defensive, well-scoped to restart/compaction paths, and thoroughly tested.
  • The changes are strictly additive safety improvements: an aggregate timeout that only fires when compaction would otherwise stall indefinitely, exception guards on abort that only prevent cascading failures, and targeted abort of compacting runs during restart. Normal compaction flows are unaffected. All new behavior paths have dedicated e2e tests, existing tests are updated to match the new timeout constants, and the code follows established patterns in the codebase (polling drain, try/catch abort, Promise.race with cleanup). No logical or security issues found after thorough analysis of all 9 changed files and their related imports.
  • No files require special attention

Last reviewed commit: 6d7ef5a

@openclaw-barnacle openclaw-barnacle bot added cli CLI command changes agents Agent runtime and tooling size: M labels Feb 15, 2026
@joeykrug joeykrug marked this pull request as ready for review February 15, 2026 21:24
Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

9 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

@joeykrug
Copy link
Copy Markdown
Contributor Author

joeykrug commented Feb 15, 2026

Note on timeout constant changes (DEFAULT_DEFERRAL_MAX_WAIT_MS and DRAIN_TIMEOUT_MS: 30s → 90s)

This PR bumps two restart-related timeouts from 30s to 90s:

  • DEFAULT_DEFERRAL_MAX_WAIT_MS in restart.ts — how long the SIGUSR1 handler defers before force-proceeding with restart
  • DRAIN_TIMEOUT_MS in run-loop.ts — how long the gateway waits for active tasks/runs to drain during restart

These can be reverted to 30s without breaking the core fixes. The three actual bug fixes are independent of these constants:

  1. The 60s aggregate timeout on waitForCompactionRetry() (the main fix) doesn't reference either constant
  2. The surgical abort of compacting runs fires regardless of drain timeout length
  3. The exception-safe abort (try/catch on handle.abort()) is purely additive

Why we think they should stay at 90s:

These timeouts only matter during SIGUSR1 restart, not normal operation. The scenario: if compaction is in-flight when restart fires, the gateway needs to either let it finish or surgically abort it. At 30s, Opus compaction (which can take 60-90s on large contexts) will almost always get aborted mid-flight. At 90s, it has a chance to complete cleanly first.

That said, aborting mid-compaction isn't actually dangerous — the PI SDK's session write lock is tied to the run lifecycle, so aborting the run releases the lock. Filesystem writes during compaction use atomic rename. The 90s is a "be polite during shutdown" buffer, not a correctness requirement.

If the formal conformance check is flagging these constants, reverting both to 30s is safe. The worst case is slightly more aggressive abort behavior during restarts (hitting the Phase 3 force-abort-all fallback more often), which is noisy (and less than ideal) but functionally correct.

@openclaw-barnacle
Copy link
Copy Markdown

This pull request has been automatically marked as stale due to inactivity.
Please add updates or it will be closed.

@openclaw-barnacle openclaw-barnacle bot added stale Marked as stale due to inactivity and removed stale Marked as stale due to inactivity labels Feb 22, 2026
@joeykrug joeykrug force-pushed the fix/compaction-lane-blocking branch from 6d7ef5a to 043c1c3 Compare March 7, 2026 02:41
@joeykrug joeykrug force-pushed the fix/compaction-lane-blocking branch from 043c1c3 to 0fc2dd8 Compare March 7, 2026 05:31
…nt mode ignore

Greptile review feedback: the second overload allowed passing both a
sessionId string and opts.mode, but the mode would be silently ignored
at runtime (single-session branch takes priority). Narrow the overloads
so TypeScript rejects this misuse:

- abortEmbeddedPiRun(sessionId: string): single session abort
- abortEmbeddedPiRun(undefined, opts): bulk abort with mode
- abortEmbeddedPiRun(): no-arg abort-all
@cgdusek
Copy link
Copy Markdown
Contributor

cgdusek commented Mar 8, 2026

Thanks for the solid root-cause write-up and patch direction here.

I opened a clean, focused PR on top of current main carrying the same primary safeguards, mainly to avoid unrelated churn and reduce merge friction:
#40324

If preferred, I can rework mine to align with this branch’s exact structure or help port any additional pieces.

@joeykrug
Copy link
Copy Markdown
Contributor Author

Closing as this was fixed in #40324

@joeykrug joeykrug closed this Mar 13, 2026
@joeykrug joeykrug deleted the fix/compaction-lane-blocking branch March 13, 2026 02:01
@joeykrug joeykrug restored the fix/compaction-lane-blocking branch March 13, 2026 02:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling cli CLI command changes size: XL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix(pi-embedded): compaction retry blocks session lane + restart collision

3 participants