fix(pi-embedded): add aggregate timeout to compaction retry + harden SIGUSR1 restart#17445
fix(pi-embedded): add aggregate timeout to compaction retry + harden SIGUSR1 restart#17445joeykrug wants to merge 4 commits intoopenclaw:mainfrom
Conversation
Note on timeout constant changes (
|
|
This pull request has been automatically marked as stale due to inactivity. |
6d7ef5a to
043c1c3
Compare
043c1c3 to
0fc2dd8
Compare
…nt mode ignore Greptile review feedback: the second overload allowed passing both a sessionId string and opts.mode, but the mode would be silently ignored at runtime (single-session branch takes priority). Narrow the overloads so TypeScript rejects this misuse: - abortEmbeddedPiRun(sessionId: string): single session abort - abortEmbeddedPiRun(undefined, opts): bulk abort with mode - abortEmbeddedPiRun(): no-arg abort-all
0fc2dd8 to
0b77438
Compare
|
Thanks for the solid root-cause write-up and patch direction here. I opened a clean, focused PR on top of current If preferred, I can rework mine to align with this branch’s exact structure or help port any additional pieces. |
|
Closing as this was fixed in #40324 |
Fixes #17444
Summary
This PR addresses two related "session looks hung" failure modes in the PI embedded runner:
What changed
1) Bound compaction retry waits (aggregate timeout)
waitForCompactionRetryWithAggregateTimeout()inrun/compaction-retry-aggregate-timeout.ts.run/attempt.ts, replace the unboundedawait abortable(waitForCompactionRetry())with a 60s aggregate timeout.timedOutDuringCompaction = trueaggregateTimeoutMsvalues fall back to 1ms rather than producingNaNtimeouts.timer !== undefined(not truthiness) to handle fake-timer environments where handle can be0.This prevents long compaction retry waits from freezing the session lane.
2) Harden SIGUSR1 restart against in-flight embedded runs/compaction
DEFAULT_DEFERRAL_MAX_WAIT_MS30s → 90s.DRAIN_TIMEOUT_MS30s → 90s.handle.isCompacting()is true)3) Exception-safe abort handling
abortEmbeddedPiRun()code paths (single session, compacting-only, abort-all) now wraphandle.abort()in try/catch.server.close()from executing during SIGUSR1 restart drain.warnlevel with the session ID and error.Why this is safe
Tests
New:
src/agents/pi-embedded-runner/run/compaction-retry-aggregate-timeout.e2e.test.ts— aggregate timeout helpersrc/agents/pi-embedded-runner/runs.e2e.test.ts— surgical abort +waitForActiveEmbeddedRunsUpdated:
src/cli/gateway-cli/run-loop.test.ts— drain timeout constantsrc/infra/infra-runtime.test.ts— deferral timeout constantNote: e2e tests require
-c vitest.e2e.config.ts(excluded from default config).Notes / follow-ups
EMBEDDED_COMPACTION_TIMEOUT_MS); if desired, we can follow up with a smaller per-attempt ceiling and/or a compaction re-entry cooldown once we have more production data.dedupe compaction failure,dedupe assistant delta parsing,reuse isPidAlive) are included to reduce code duplication touched during the investigation.Greptile Summary
This PR fixes two related "session hung" failure modes in the PI embedded runner:
waitForCompactionRetry()in the embedded run attempt loop, preventing unbounded blocking of the session lane. On timeout, the run proceeds with the pre-compaction snapshot. The newwaitForCompactionRetryWithAggregateTimeouthelper usesPromise.racewith proper timer cleanup and defensive normalization for non-finite timeout values.abortEmbeddedPiRuncode paths now wraphandle.abort()in try/catch, preventing a thrown abort handler from cascading into the shutdown sequence.Changes are well-localized to restart/compaction paths, and existing behavior for normal compaction runs is unchanged. Test coverage is thorough with new e2e tests for the aggregate timeout helper and surgical abort modes.
Confidence Score: 5/5
Last reviewed commit: 6d7ef5a