Skip to content

fix(gateway): resolve PID mismatch in restart health check#24301

Closed
justinhuangcode wants to merge 3 commits intoopenclaw:mainfrom
justinhuangcode:fix/gateway-restart-pid-mismatch
Closed

fix(gateway): resolve PID mismatch in restart health check#24301
justinhuangcode wants to merge 3 commits intoopenclaw:mainfrom
justinhuangcode:fix/gateway-restart-pid-mismatch

Conversation

@justinhuangcode
Copy link
Copy Markdown
Contributor

@justinhuangcode justinhuangcode commented Feb 23, 2026

Summary

Fixes #24279

openclaw gateway restart always times out because the health check compares the port-listener PID (child process) directly against the systemd MainPID (supervisor process). Since the gateway runs as a two-process tree (supervisor → child that binds the port), these are structurally always different, causing:

  1. ownsPort is always falsehealthy is always false
  2. The child PID is always classified as stale → gets killed
  3. The CLI waits 60s for health to become true, which never happens

Changes

  • Add isDescendantOf(childPid, ancestorPid) that walks /proc/<pid>/status PPid field to check parent-child relationship
  • Update ownsPort check: accept port listeners that are descendants of MainPID
  • Update staleGatewayPids filter: descendants of MainPID are not stale
  • Add unit tests for isDescendantOf

Fallback behavior

On non-Linux systems (macOS, Windows) where /proc is unavailable, isDescendantOf returns false gracefully, preserving the existing direct-PID comparison behavior.

Testing

  • src/cli/daemon-cli/restart-health.test.ts — unit tests for isDescendantOf (7 cases)
  • Existing lifecycle.test.ts tests are unaffected (they mock the restart-health module)

AI-assisted: Yes (Claude). Prompts/session logs available on request.

Greptile Summary

Fixed openclaw gateway restart timeout by adding process tree awareness to health checks. The gateway runs as supervisor → child process, where the child binds the port. Previously, health checks compared port listener PID directly against supervisor PID, causing perpetual "unhealthy" status.

Changes:

  • Added isDescendantOf() to walk /proc/<pid>/status and check parent-child relationships
  • Updated ownsPort check to accept descendants of MainPID as valid owners
  • Updated staleGatewayPids filter to exclude descendants of MainPID
  • Added 7 unit tests covering direct children, grandchildren, error cases, and edge cases
  • Graceful fallback on non-Linux systems (returns false, preserves existing behavior)

Confidence Score: 5/5

  • This PR is safe to merge with minimal risk
  • The fix directly addresses the reported issue with a clean implementation. The isDescendantOf function has proper safeguards (visited set to prevent loops, PID <= 1 boundary check, error handling), comprehensive test coverage (7 test cases including edge cases), and graceful degradation on non-Linux systems. The changes are surgical - only modifying the PID comparison logic without touching unrelated code.
  • No files require special attention

Last reviewed commit: 133421b

The gateway runs as a two-process tree (supervisor → child), but the
health check compared the port-listener PID (child) directly against
the systemd MainPID (supervisor).  Because they are structurally
different, `ownsPort` was always false and the child was always
classified as stale, causing `openclaw gateway restart` to kill the
healthy child and then time out.

Walk the process tree via /proc/<pid>/status PPid field so that any
descendant of MainPID is recognised as owned.

Fixes openclaw#24279

AI-assisted: Yes (Claude). Prompts/session logs available on request.
@justinhuangcode justinhuangcode force-pushed the fix/gateway-restart-pid-mismatch branch from f8e4637 to f95efcd Compare February 23, 2026 15:18
@justinhuangcode
Copy link
Copy Markdown
Contributor Author

Closing: the underlying issue has been resolved by #24696 (merged), which addresses the same PID ownership problem via a different approach. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cli CLI command changes size: S

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: openclaw gateway restart always times out due to MainPID vs child PID mismatch in health check

1 participant