fix(pi-embedded): add aggregate timeout to compaction retry + harden SIGUSR1 restart by joeykrug · Pull Request #17445 · openclaw/openclaw

joeykrug · 2026-02-15T19:24:02Z

Fixes #17444

Summary

This PR addresses two related "session looks hung" failure modes in the PI embedded runner:

Compaction retry waits can block the session lane with no aggregate timeout.
SIGUSR1 in-process restart can proceed while embedded compaction is still in-flight, leaving session write locks held by the previous lifecycle.

What changed

1) Bound compaction retry waits (aggregate timeout)

Add waitForCompactionRetryWithAggregateTimeout() in run/compaction-retry-aggregate-timeout.ts.
In run/attempt.ts, replace the unbounded await abortable(waitForCompactionRetry()) with a 60s aggregate timeout.
On aggregate timeout:
- set timedOutDuringCompaction = true
- log a warning and proceed using the existing pre-compaction snapshot selection logic
Defensive normalization: non-finite aggregateTimeoutMs values fall back to 1ms rather than producing NaN timeouts.
Timer cleanup uses timer !== undefined (not truthiness) to handle fake-timer environments where handle can be 0.

This prevents long compaction retry waits from freezing the session lane.

2) Harden SIGUSR1 restart against in-flight embedded runs/compaction

Increase default SIGUSR1 deferral budget: DEFAULT_DEFERRAL_MAX_WAIT_MS 30s → 90s.
Increase gateway run-loop drain budget: DRAIN_TIMEOUT_MS 30s → 90s.
On SIGUSR1 restart in the gateway run loop:
1. Abort compacting embedded runs (best-effort, surgical — only runs where handle.isCompacting() is true)
2. Drain both active command-queue tasks and active embedded runs (up to 90s)
3. If drain times out, abort all embedded runs (best-effort) and proceed

3) Exception-safe abort handling

All three abortEmbeddedPiRun() code paths (single session, compacting-only, abort-all) now wrap handle.abort() in try/catch.
A thrown abort handler no longer prevents server.close() from executing during SIGUSR1 restart drain.
Failed aborts are logged at warn level with the session ID and error.

Why this is safe

The aggregate timeout only affects cases where compaction retry waits would otherwise stall; normal compaction runs are unchanged.
Restart already implies disruption; aborting compacting runs is a targeted best-effort to ensure session locks are released.
Exception guards on abort are strictly additive — they only prevent cascading failures.
Changes are localized to restart/compaction wait paths.

Tests

New:

src/agents/pi-embedded-runner/run/compaction-retry-aggregate-timeout.e2e.test.ts — aggregate timeout helper
src/agents/pi-embedded-runner/runs.e2e.test.ts — surgical abort + waitForActiveEmbeddedRuns

Updated:

src/cli/gateway-cli/run-loop.test.ts — drain timeout constant
src/infra/infra-runtime.test.ts — deferral timeout constant

Note: e2e tests require -c vitest.e2e.config.ts (excluded from default config).

pnpm -s lint
pnpm -s format:check
npx vitest run src/agents/pi-embedded-runner/ src/cli/gateway-cli/run-loop.test.ts src/infra/infra-runtime.test.ts --reporter=verbose
npx vitest run -c vitest.e2e.config.ts src/agents/pi-embedded-runner/runs.e2e.test.ts src/agents/pi-embedded-runner/run/compaction-retry-aggregate-timeout.e2e.test.ts --reporter=verbose

Notes / follow-ups

This PR does not change per-attempt compaction safety timeouts (EMBEDDED_COMPACTION_TIMEOUT_MS); if desired, we can follow up with a smaller per-attempt ceiling and/or a compaction re-entry cooldown once we have more production data.
Refactor commits (dedupe compaction failure, dedupe assistant delta parsing, reuse isPidAlive) are included to reduce code duplication touched during the investigation.

Greptile Summary

This PR fixes two related "session hung" failure modes in the PI embedded runner:

Aggregate timeout for compaction retry waits: Adds a 60s aggregate timeout to waitForCompactionRetry() in the embedded run attempt loop, preventing unbounded blocking of the session lane. On timeout, the run proceeds with the pre-compaction snapshot. The new waitForCompactionRetryWithAggregateTimeout helper uses Promise.race with proper timer cleanup and defensive normalization for non-finite timeout values.
Hardened SIGUSR1 restart against in-flight compaction: The gateway run loop now surgically aborts compacting embedded runs before draining, waits for both active tasks and embedded runs in parallel (up to 90s), and falls back to aborting all runs if the drain times out. The deferral budget and drain timeout are increased from 30s to 90s to match observed production compaction durations.
Exception-safe abort handling: All abortEmbeddedPiRun code paths now wrap handle.abort() in try/catch, preventing a thrown abort handler from cascading into the shutdown sequence.

Changes are well-localized to restart/compaction paths, and existing behavior for normal compaction runs is unchanged. Test coverage is thorough with new e2e tests for the aggregate timeout helper and surgical abort modes.

Confidence Score: 5/5

This PR is safe to merge — changes are defensive, well-scoped to restart/compaction paths, and thoroughly tested.
The changes are strictly additive safety improvements: an aggregate timeout that only fires when compaction would otherwise stall indefinitely, exception guards on abort that only prevent cascading failures, and targeted abort of compacting runs during restart. Normal compaction flows are unaffected. All new behavior paths have dedicated e2e tests, existing tests are updated to match the new timeout constants, and the code follows established patterns in the codebase (polling drain, try/catch abort, Promise.race with cleanup). No logical or security issues found after thorough analysis of all 9 changed files and their related imports.
No files require special attention

_{Last reviewed commit: 6d7ef5a}

greptile-apps

_{9 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

src/agents/pi-embedded-runner/runs.ts

joeykrug · 2026-02-15T21:43:57Z

Note on timeout constant changes (`DEFAULT_DEFERRAL_MAX_WAIT_MS` and `DRAIN_TIMEOUT_MS`: 30s → 90s)

This PR bumps two restart-related timeouts from 30s to 90s:

DEFAULT_DEFERRAL_MAX_WAIT_MS in restart.ts — how long the SIGUSR1 handler defers before force-proceeding with restart
DRAIN_TIMEOUT_MS in run-loop.ts — how long the gateway waits for active tasks/runs to drain during restart

These can be reverted to 30s without breaking the core fixes. The three actual bug fixes are independent of these constants:

The 60s aggregate timeout on waitForCompactionRetry() (the main fix) doesn't reference either constant
The surgical abort of compacting runs fires regardless of drain timeout length
The exception-safe abort (try/catch on handle.abort()) is purely additive

Why we think they should stay at 90s:

These timeouts only matter during SIGUSR1 restart, not normal operation. The scenario: if compaction is in-flight when restart fires, the gateway needs to either let it finish or surgically abort it. At 30s, Opus compaction (which can take 60-90s on large contexts) will almost always get aborted mid-flight. At 90s, it has a chance to complete cleanly first.

That said, aborting mid-compaction isn't actually dangerous — the PI SDK's session write lock is tied to the run lifecycle, so aborting the run releases the lock. Filesystem writes during compaction use atomic rename. The 90s is a "be polite during shutdown" buffer, not a correctness requirement.

If the formal conformance check is flagging these constants, reverting both to 30s is safe. The worst case is slightly more aggressive abort behavior during restarts (hitting the Phase 3 force-abort-all fallback more often), which is noisy (and less than ideal) but functionally correct.

openclaw-barnacle · 2026-02-22T04:07:17Z

This pull request has been automatically marked as stale due to inactivity.
Please add updates or it will be closed.

…imeout cleanup

…nt mode ignore Greptile review feedback: the second overload allowed passing both a sessionId string and opts.mode, but the mode would be silently ignored at runtime (single-session branch takes priority). Narrow the overloads so TypeScript rejects this misuse: - abortEmbeddedPiRun(sessionId: string): single session abort - abortEmbeddedPiRun(undefined, opts): bulk abort with mode - abortEmbeddedPiRun(): no-arg abort-all

cgdusek · 2026-03-08T22:45:10Z

Thanks for the solid root-cause write-up and patch direction here.

I opened a clean, focused PR on top of current main carrying the same primary safeguards, mainly to avoid unrelated churn and reduce merge friction:
#40324

If preferred, I can rework mine to align with this branch’s exact structure or help port any additional pieces.

joeykrug · 2026-03-13T02:00:54Z

Closing as this was fixed in #40324

openclaw-barnacle bot added cli CLI command changes agents Agent runtime and tooling size: M labels Feb 15, 2026

joeykrug marked this pull request as ready for review February 15, 2026 21:24

greptile-apps bot reviewed Feb 15, 2026

View reviewed changes

src/agents/pi-embedded-runner/runs.ts Show resolved Hide resolved

joeykrug marked this pull request as draft February 15, 2026 21:47

joeykrug marked this pull request as ready for review February 15, 2026 21:47

amabito mentioned this pull request Feb 16, 2026

fix: contain abort retries during gateway restart (#17589) #17940

Closed

4 tasks

steipete closed this Feb 16, 2026

steipete reopened this Feb 17, 2026

okuyam2y mentioned this pull request Feb 20, 2026

fix(pi-embedded): compaction retry blocks session lane + restart collision #17444

Closed

openclaw-barnacle bot added stale Marked as stale due to inactivity and removed stale Marked as stale due to inactivity labels Feb 22, 2026

joeykrug force-pushed the fix/compaction-lane-blocking branch from 6d7ef5a to 043c1c3 Compare March 7, 2026 02:41

akz142857 mentioned this pull request Mar 7, 2026

fix(agents): prevent session freeze on compaction timeout #38585

Closed

7 tasks

joeykrug added 3 commits March 7, 2026 01:25

fix(pi-embedded): bound compaction retry wait + harden SIGUSR1 restart

1bb645c

fix(pi-embedded): address review feedback — harden abort handling + t…

7e2e734

…imeout cleanup

ci: retrigger checks (windows gateway-lock flaky on CI)

5e56f72

joeykrug force-pushed the fix/compaction-lane-blocking branch from 043c1c3 to 0fc2dd8 Compare March 7, 2026 05:31

openclaw-barnacle bot added size: XL and removed size: M labels Mar 7, 2026

joeykrug force-pushed the fix/compaction-lane-blocking branch from 0fc2dd8 to 0b77438 Compare March 7, 2026 05:40

cgdusek mentioned this pull request Mar 8, 2026

fix(agents): bound compaction retry wait and drain embedded runs on restart #40324

Merged

joeykrug closed this Mar 13, 2026

joeykrug deleted the fix/compaction-lane-blocking branch March 13, 2026 02:01

joeykrug restored the fix/compaction-lane-blocking branch March 13, 2026 02:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(pi-embedded): add aggregate timeout to compaction retry + harden SIGUSR1 restart#17445

fix(pi-embedded): add aggregate timeout to compaction retry + harden SIGUSR1 restart#17445
joeykrug wants to merge 4 commits intoopenclaw:mainfrom
joeykrug:fix/compaction-lane-blocking

joeykrug commented Feb 15, 2026 •

edited by greptile-apps bot

Loading

Uh oh!

greptile-apps bot left a comment

Uh oh!

Uh oh!

joeykrug commented Feb 15, 2026 •

edited

Loading

Uh oh!

openclaw-barnacle bot commented Feb 22, 2026

Uh oh!

cgdusek commented Mar 8, 2026

Uh oh!

joeykrug commented Mar 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

joeykrug commented Feb 15, 2026 • edited by greptile-apps bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed

1) Bound compaction retry waits (aggregate timeout)

2) Harden SIGUSR1 restart against in-flight embedded runs/compaction

3) Exception-safe abort handling

Why this is safe

Tests

Notes / follow-ups

Greptile Summary

Confidence Score: 5/5

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

joeykrug commented Feb 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Note on timeout constant changes (DEFAULT_DEFERRAL_MAX_WAIT_MS and DRAIN_TIMEOUT_MS: 30s → 90s)

Uh oh!

openclaw-barnacle bot commented Feb 22, 2026

Uh oh!

cgdusek commented Mar 8, 2026

Uh oh!

joeykrug commented Mar 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

joeykrug commented Feb 15, 2026 •

edited by greptile-apps bot

Loading

joeykrug commented Feb 15, 2026 •

edited

Loading

Note on timeout constant changes (`DEFAULT_DEFERRAL_MAX_WAIT_MS` and `DRAIN_TIMEOUT_MS`: 30s → 90s)