fix(gateway): resolve PID mismatch in restart health check#24301
Closed
justinhuangcode wants to merge 3 commits intoopenclaw:mainfrom
Closed
fix(gateway): resolve PID mismatch in restart health check#24301justinhuangcode wants to merge 3 commits intoopenclaw:mainfrom
justinhuangcode wants to merge 3 commits intoopenclaw:mainfrom
Conversation
cbc29d8 to
f244b6e
Compare
f244b6e to
f8e4637
Compare
The gateway runs as a two-process tree (supervisor → child), but the health check compared the port-listener PID (child) directly against the systemd MainPID (supervisor). Because they are structurally different, `ownsPort` was always false and the child was always classified as stale, causing `openclaw gateway restart` to kill the healthy child and then time out. Walk the process tree via /proc/<pid>/status PPid field so that any descendant of MainPID is recognised as owned. Fixes openclaw#24279 AI-assisted: Yes (Claude). Prompts/session logs available on request.
f8e4637 to
f95efcd
Compare
Contributor
Author
|
Closing: the underlying issue has been resolved by #24696 (merged), which addresses the same PID ownership problem via a different approach. Thank you! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #24279
openclaw gateway restartalways times out because the health check compares the port-listener PID (child process) directly against the systemdMainPID(supervisor process). Since the gateway runs as a two-process tree (supervisor → child that binds the port), these are structurally always different, causing:ownsPortis alwaysfalse→healthyis alwaysfalsetrue, which never happensChanges
isDescendantOf(childPid, ancestorPid)that walks/proc/<pid>/statusPPidfield to check parent-child relationshipownsPortcheck: accept port listeners that are descendants ofMainPIDstaleGatewayPidsfilter: descendants ofMainPIDare not staleisDescendantOfFallback behavior
On non-Linux systems (macOS, Windows) where
/procis unavailable,isDescendantOfreturnsfalsegracefully, preserving the existing direct-PID comparison behavior.Testing
src/cli/daemon-cli/restart-health.test.ts— unit tests forisDescendantOf(7 cases)lifecycle.test.tstests are unaffected (they mock therestart-healthmodule)AI-assisted: Yes (Claude). Prompts/session logs available on request.
Greptile Summary
Fixed
openclaw gateway restarttimeout by adding process tree awareness to health checks. The gateway runs as supervisor → child process, where the child binds the port. Previously, health checks compared port listener PID directly against supervisor PID, causing perpetual "unhealthy" status.Changes:
isDescendantOf()to walk/proc/<pid>/statusand check parent-child relationshipsownsPortcheck to accept descendants of MainPID as valid ownersstaleGatewayPidsfilter to exclude descendants of MainPIDConfidence Score: 5/5
isDescendantOffunction has proper safeguards (visited set to prevent loops, PID <= 1 boundary check, error handling), comprehensive test coverage (7 test cases including edge cases), and graceful degradation on non-Linux systems. The changes are surgical - only modifying the PID comparison logic without touching unrelated code.Last reviewed commit: 133421b