feat: add external rescue watchdog by NickHood1984 · Pull Request #40791 · openclaw/openclaw

NickHood1984 · 2026-03-09T08:50:09Z

Summary

Problem: OpenClaw can report health and channel errors, but there was no external rescue loop that stays alive when the gateway itself is unhealthy.
Why it matters: in-process recovery dies with the same process, so failures that stop replies had to be diagnosed and handed off manually.
What changed: added an external rescue watchdog script, a Claude Code runner, systemd user service/timer templates, and setup docs.
What did NOT change (scope boundary): no core gateway runtime behavior, no automatic commit/push flow, and no bundled launchd plist.

Change Type (select all)

Feature
Docs
Chore/infra

Scope (select all touched areas)

Gateway / orchestration
UI / DX
CI/CD / infra

Linked Issue/PR

Related #

User-visible / Behavior Changes

Added scripts/rescue-watchdog.sh to detect openclaw health --json failures and channel lastError snapshots, capture incident bundles, de-duplicate repeats, optionally notify, and optionally invoke a rescue runner.
Added scripts/claude-rescue-runner.sh as a minimal Claude Code handoff runner for captured incidents.
Added scripts/systemd/openclaw-rescue-watchdog.{service,timer} templates.
Added docs/automation/rescue-watchdog.md and linked it from auth monitoring docs.

Security Impact (required)

New permissions/capabilities? (Yes)
Secrets/tokens handling changed? (No)
New/changed network calls? (Yes)
Command/tool execution surface changed? (Yes)
Data access scope changed? (Yes)
If any Yes, explain risk + mitigation:
The watchdog and runner are optional, external scripts. They execute only when explicitly installed/configured, default to no notification target and no runner, and keep the repair path outside the gateway process. The Claude runner is conservative by default (acceptEdits) and does not commit or push changes.

Repro + Verification

Environment

OS: macOS
Runtime/container: local source checkout
Model/provider: Claude Code CLI runner path validated with a stubbed CLAUDE_BIN=echo
Integration/channel (if any): none configured for notification during verification
Relevant config (redacted): synthetic incident via OPENCLAW_BIN=false

Steps

Run bash -n scripts/rescue-watchdog.sh and bash -n scripts/claude-rescue-runner.sh.
Trigger a synthetic incident with OPENCLAW_BIN=false OPENCLAW_RESCUE_STATE_DIR="$PWD/.artifacts/rescue-watchdog-test" ./scripts/rescue-watchdog.sh.
Re-run the same command to verify duplicate suppression.
Run OPENCLAW_RESCUE_INCIDENT_DIR=... OPENCLAW_RESCUE_WORKSPACE_DIR="$PWD" CLAUDE_BIN=echo ./scripts/claude-rescue-runner.sh to validate prompt/output generation.

Expected

The watchdog writes an incident bundle on first failure.
Repeated identical incidents within cooldown are skipped.
The Claude runner writes claude-prompt.txt and claude-output.txt.

Actual

Matched expected behavior.

Evidence

Failing test/log before + passing after
Trace/log snippets
Screenshot/recording
Perf numbers (if relevant)

Human Verification (required)

What you personally verified (not just CI), and how:

Verified scenarios: shell syntax checks, synthetic incident capture, duplicate suppression, runner prompt/output generation.
Edge cases checked: health command failure path, repeated identical incident suppression, runner not configured path.
What you did not verify: a live end-to-end repair against a real broken gateway or a real notification channel.

Review Conversations

I replied to or resolved every bot review conversation I addressed in this PR.
I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

Backward compatible? (Yes)
Config/env changes? (Yes)
Migration needed? (No)
If yes, exact upgrade steps:
Set the optional watchdog/runner environment variables and enable the provided systemd timer if you want the rescue flow.

Failure Recovery (if this breaks)

How to disable/revert this change quickly: do not install/enable the timer, or remove the copied user systemd unit files.
Files/config to restore: remove scripts/systemd/openclaw-rescue-watchdog.{service,timer} from the user systemd dir and unset OPENCLAW_RESCUE_* / CLAUDE_RESCUE_* env vars.
Known bad symptoms reviewers should watch for: unexpected repeated incident notifications due to local environment differences, or runner failures because claude is unavailable on PATH.

Risks and Mitigations

Risk: operators may assume the rescue path is active by default.
Mitigation: the docs explicitly describe the flow as optional and external.
Risk: the runner may be too permissive for some environments.
Mitigation: the sample runner defaults to acceptEdits, does not commit/push, and is fully configurable via env vars.

greptile-apps · 2026-03-09T08:52:49Z

Greptile Summary

This PR adds an optional external rescue watchdog (scripts/rescue-watchdog.sh) that polls openclaw health, captures incident bundles on failure, deduplicates repeat events, optionally sends notifications, and can hand the incident off to a Claude Code runner (scripts/claude-rescue-runner.sh). Systemd user timer templates and setup docs are also included. The change is additive and backward compatible — nothing runs unless explicitly installed.

Key findings:

scripts/rescue-watchdog.sh line 240: the runner is launched via bash -lc "$RUNNER_CMD" (login shell). On many systems profile scripts (.bash_profile, etc.) reset PATH or unset environment variables, including the OPENCLAW_RESCUE_* variables the watchdog exports. If those variables are cleared, claude-rescue-runner.sh exits with OPENCLAW_RESCUE_INCIDENT_DIR is required. Using bash -c instead of bash -lc would be safer unless login-shell PATH resolution is explicitly needed.
scripts/claude-rescue-runner.sh lines 34–67: the heredoc uses EOF as its delimiter while expanding $summary_text inside it. Although summary_text is currently always single-line, any future openclaw error message that produces a bare EOF line would silently truncate the Claude prompt. Using a less common delimiter (e.g. RESCUE_WATCHDOG_PROMPT_END) eliminates this edge case at no cost.

Confidence Score: 3/5

Safe to merge with minor fixes; the login-shell runner invocation could silently break the rescue path in certain environments.
The change is fully opt-in and additive — no existing behavior is altered. The two flagged issues are low-to-moderate: the bash -lc runner invocation is a real risk in environments with profile-script side effects, and the heredoc delimiter is a defensive concern rather than a confirmed bug. Neither blocks the happy path described in the PR, but the bash -lc issue could cause the runner to fail silently in CI or containerized setups where profile scripts reset the environment.
Pay close attention to scripts/rescue-watchdog.sh (runner invocation at line 240) and scripts/claude-rescue-runner.sh (heredoc delimiter at lines 34–67).

_{Last reviewed commit: 817a95c}

greptile-apps · 2026-03-09T08:52:52Z

scripts/rescue-watchdog.sh

+  if bash -lc "$RUNNER_CMD" >"$incident_dir/runner.stdout" 2>"$incident_dir/runner.stderr"; then
+    runner_status="ok"
+  else
+    runner_status="failed:$?"
+  fi


Login shell may reset environment for runner

bash -lc "$RUNNER_CMD" spawns a login shell, which sources /etc/profile, ~/.bash_profile, and similar profile scripts. In some environments these profile scripts reset PATH, unset exported variables, or override other env vars that the watchdog carefully exports (e.g. OPENCLAW_RESCUE_INCIDENT_DIR, OPENCLAW_RESCUE_WORKSPACE_DIR). If those variables disappear, the runner will fail with OPENCLAW_RESCUE_INCIDENT_DIR is required.

For the typical operator use-case where RUNNER_CMD is simply a script path, using bash -c (non-login) or direct execution would be more predictable:

Suggested change

if bash -lc "$RUNNER_CMD" >"$incident_dir/runner.stdout" 2>"$incident_dir/runner.stderr"; then

runner_status="ok"

else

runner_status="failed:$?"

fi

if bash -c "$RUNNER_CMD" >"$incident_dir/runner.stdout" 2>"$incident_dir/runner.stderr"; then

runner_status="ok"

else

runner_status="failed:$?"

fi

If login-shell PATH resolution is intentionally needed (e.g. so the runner can find claude installed via a shell profile), consider documenting this expectation, or have the runner resolve its own PATH instead.

Prompt To Fix With AI

This is a comment left during a code review. Path: scripts/rescue-watchdog.sh Line: 240-244 Comment: **Login shell may reset environment for runner** `bash -lc "$RUNNER_CMD"` spawns a login shell, which sources `/etc/profile`, `~/.bash_profile`, and similar profile scripts. In some environments these profile scripts reset `PATH`, unset exported variables, or override other env vars that the watchdog carefully exports (e.g. `OPENCLAW_RESCUE_INCIDENT_DIR`, `OPENCLAW_RESCUE_WORKSPACE_DIR`). If those variables disappear, the runner will fail with `OPENCLAW_RESCUE_INCIDENT_DIR is required`. For the typical operator use-case where `RUNNER_CMD` is simply a script path, using `bash -c` (non-login) or direct execution would be more predictable: ```suggestion if bash -c "$RUNNER_CMD" >"$incident_dir/runner.stdout" 2>"$incident_dir/runner.stderr"; then runner_status="ok" else runner_status="failed:$?" fi ``` If login-shell PATH resolution is intentionally needed (e.g. so the runner can find `claude` installed via a shell profile), consider documenting this expectation, or have the runner resolve its own PATH instead. How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-03-09T08:52:53Z

scripts/claude-rescue-runner.sh

+cat >"$PROMPT_FILE" <<EOF
+An OpenClaw rescue watchdog detected a runtime failure.
+
+Goals:
+- Restore reply capability with the smallest safe fix.
+- Prefer narrow changes over refactors.
+- Do not commit, push, or modify files outside this workspace.
+- Use the captured incident artifacts before exploring broadly.
+
+Workspace:
+- Repo: $WORKSPACE_DIR
+- Incident dir: $INCIDENT_DIR
+
+Artifacts:
+- Summary: $SUMMARY_FILE
+- Health snapshot: $HEALTH_FILE
+- CLI status: $STATUS_FILE
+- Gateway log tail: $LOG_FILE
+
+Required steps:
+1. Read the incident artifacts first.
+2. Inspect only the code paths suggested by the incident.
+3. If you find a safe fix, apply it in this workspace.
+4. Run focused verification. Prefer pnpm build; if that is too expensive, explain why and run a narrower check.
+5. If possible, run openclaw health --json --timeout 10000 after the fix.
+6. Print a concise operator summary with:
+   - root cause
+   - files changed
+   - verification run
+   - remaining risk
+
+Incident summary:
+$summary_text
+EOF


Heredoc delimiter could be prematurely closed

The heredoc uses EOF as its delimiter, and $summary_text is expanded inside it. If an openclaw error message ever produces a line that is exactly the string EOF (e.g. a log line like ... error at EOF), bash will close the heredoc at that line, silently truncating the prompt written to $PROMPT_FILE.

Using a less likely delimiter prevents this:

Suggested change

cat >"$PROMPT_FILE" <<EOF

An OpenClaw rescue watchdog detected a runtime failure.

Goals:

- Restore reply capability with the smallest safe fix.

- Prefer narrow changes over refactors.

- Do not commit, push, or modify files outside this workspace.

- Use the captured incident artifacts before exploring broadly.

Workspace:

- Repo: $WORKSPACE_DIR

- Incident dir: $INCIDENT_DIR

Artifacts:

- Summary: $SUMMARY_FILE

- Health snapshot: $HEALTH_FILE

- CLI status: $STATUS_FILE

- Gateway log tail: $LOG_FILE

Required steps:

1. Read the incident artifacts first.

2. Inspect only the code paths suggested by the incident.

3. If you find a safe fix, apply it in this workspace.

4. Run focused verification. Prefer pnpm build; if that is too expensive, explain why and run a narrower check.

5. If possible, run openclaw health --json --timeout 10000 after the fix.

6. Print a concise operator summary with:

- root cause

- files changed

- verification run

- remaining risk

Incident summary:

$summary_text

EOF

cat >"$PROMPT_FILE" <<'CLAUDE_PROMPT_EOF'

An OpenClaw rescue watchdog detected a runtime failure.

Goals:

- Restore reply capability with the smallest safe fix.

- Prefer narrow changes over refactors.

- Do not commit, push, or modify files outside this workspace.

- Use the captured incident artifacts before exploring broadly.

Workspace:

- Repo: WORKSPACE_DIR_PLACEHOLDER

- Incident dir: INCIDENT_DIR_PLACEHOLDER

Note: switching to a quoted (<<'...') heredoc disables variable expansion. A safer approach that keeps expansion is to use a unique, non-guessable delimiter:

cat >"$PROMPT_FILE" <<RESCUE_WATCHDOG_PROMPT_END ... $summary_text RESCUE_WATCHDOG_PROMPT_END

This makes accidental closure essentially impossible while keeping all $variable substitutions working.

Prompt To Fix With AI

This is a comment left during a code review. Path: scripts/claude-rescue-runner.sh Line: 34-67 Comment: **Heredoc delimiter could be prematurely closed** The heredoc uses `EOF` as its delimiter, and `$summary_text` is expanded inside it. If an `openclaw` error message ever produces a line that is exactly the string `EOF` (e.g. a log line like `... error at EOF`), bash will close the heredoc at that line, silently truncating the prompt written to `$PROMPT_FILE`. Using a less likely delimiter prevents this: ```suggestion cat >"$PROMPT_FILE" <<'CLAUDE_PROMPT_EOF' An OpenClaw rescue watchdog detected a runtime failure. Goals: - Restore reply capability with the smallest safe fix. - Prefer narrow changes over refactors. - Do not commit, push, or modify files outside this workspace. - Use the captured incident artifacts before exploring broadly. Workspace: - Repo: WORKSPACE_DIR_PLACEHOLDER - Incident dir: INCIDENT_DIR_PLACEHOLDER ``` Note: switching to a quoted (`<<'...'`) heredoc disables variable expansion. A safer approach that keeps expansion is to use a unique, non-guessable delimiter: ```bash cat >"$PROMPT_FILE" <<RESCUE_WATCHDOG_PROMPT_END ... $summary_text RESCUE_WATCHDOG_PROMPT_END ``` This makes accidental closure essentially impossible while keeping all `$variable` substitutions working. How can I resolve this? If you propose a fix, please make it concise.

NickHood1984 · 2026-03-13T17:48:29Z

Addressed the Greptile feedback in d8b47c5.

switched the runner invocation from bash -lc to bash -c so the exported OPENCLAW_RESCUE_* environment survives runner launch more predictably
changed the Claude prompt heredoc delimiter from EOF to RESCUE_WATCHDOG_PROMPT_END to avoid accidental prompt truncation

Re-checked both scripts with bash -n.

NickHood1984 · 2026-03-13T18:08:25Z

Added a follow-up fix for the failing merge check in src/gateway/server-methods/chat.abort.test-helpers.ts.

This adds explicit exported helper types so the generated declaration surface no longer depends on inferred @vitest/spy internals (TS2883).

Local verification on this branch:

pnpm check

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8e1544c663

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-13T18:14:12Z

scripts/rescue-watchdog.sh

+if [ -z "$incident_reason" ]; then
+  echo "openclaw rescue watchdog: healthy"
+  exit 0


Clear suppression state after a healthy cycle

When incident_reason is empty, this branch exits immediately without updating state.env, so LAST_FINGERPRINT and LAST_TRIGGERED_AT from a prior outage remain active. In a fail → recover → fail sequence within OPENCLAW_RESCUE_COOLDOWN_SEC, the second outage is incorrectly treated as a duplicate and skips incident capture/runner execution/notification even though it is a new event after recovery.

Useful? React with 👍 / 👎.

NickHood1984 · 2026-03-13T18:31:42Z

Updated the branch to current main and fixed the resulting test-only type regressions that were failing the merge check.

Local verification on this branch:

pnpm check

The watchdog changes are unchanged; this follow-up only touches test typing/inference on top of the current base.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 22b231e30c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-13T18:38:50Z

scripts/rescue-watchdog.sh

+incident_id="$(date -u +"%Y%m%dT%H%M%SZ")"
+incident_dir="$INCIDENT_ROOT/$incident_id"
+mkdir -p "$incident_dir"


Generate unique incident IDs per run

When two watchdog invocations detect an incident within the same second (for example, an overlapping cron run plus a manual run), both derive the same incident_id and write into the same directory because the ID is second-granularity and mkdir -p reuses existing paths. Under set -e, this can cause one run to fail during file writes/copies and leave a mixed or partial incident bundle instead of two independent captures.

Useful? React with 👍 / 👎.

NickHood1984 · 2026-03-14T03:29:19Z

Synced this branch with the latest upstream main in commit fc6cfc7 and resolved the merge conflicts in the previously added test-only type fixes.\n\nLocal verification on 2026-03-14:\n- pnpm install --frozen-lockfile\n- pnpm tsgo\n- pnpm exec vitest run --config vitest.unit.config.ts src/plugins/install.test.ts src/hooks/install.test.ts src/gateway/server-runtime-config.test.ts src/gateway/server.talk-config.test.ts src/telegram/network-config.test.ts src/infra/outbound/targets.test.ts src/infra/provider-usage.auth.normalizes-keys.test.ts\n- pnpm check\n\nThis also aligns the npm install assertion helper with the current install command shape after the recent upstream changes.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fc6cfc7c24

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-14T03:36:02Z

scripts/rescue-watchdog.sh

+fi
+printf '%s\n' "$runner_status" >"$incident_dir/runner.status"
+
+write_state "$fingerprint" "$now_epoch" "$incident_id"


Persist suppression state before invoking runner

The watchdog writes state.env only after RUNNER_CMD returns, which leaves the latest fingerprint/cooldown invisible during long rescue runs. If another invocation starts in that window (for example, cron overlap or a manual rerun), it will treat the same outage as new and run a second incident capture/runner cycle, producing duplicate remediation attempts and notifications against the same failure. Save suppression state before launching the runner (or guard with a lock) so overlapping executions are deduplicated immediately.

Useful? React with 👍 / 👎.

NickHood1984 · 2026-03-14T08:11:06Z

Split the inherited CI fixes out into separate PRs so this feature PR can stay scoped:

extensions/diffs: fix: load pierre themes without json module imports #45869
Windows path/runtime tests: fix: align windows path tests with runtime behavior #45870

Those failures were inherited from main, not introduced by the rescue watchdog changes.

feat: add external rescue watchdog

817a95c

openclaw-barnacle bot added docs Improvements or additions to documentation scripts Repository scripts size: M labels Mar 9, 2026

greptile-apps bot reviewed Mar 9, 2026

View reviewed changes

fix: address rescue watchdog review comments

d8b47c5

fix: annotate chat abort test helpers

8e1544c

openclaw-barnacle bot added the gateway Gateway runtime label Mar 13, 2026

chatgpt-codex-connector bot reviewed Mar 13, 2026

View reviewed changes

NIckhood-123 added 2 commits March 14, 2026 02:26

Merge remote-tracking branch 'upstream/main' into codex/rescue-watchdog

5429c7d

test: fix type coverage after main merge

22b231e

openclaw-barnacle bot added the channel: telegram Channel integration: telegram label Mar 13, 2026

chatgpt-codex-connector bot reviewed Mar 13, 2026

View reviewed changes

Merge remote-tracking branch 'upstream/main' into codex/rescue-watchdog

fc6cfc7

openclaw-barnacle bot removed the gateway Gateway runtime label Mar 14, 2026

chatgpt-codex-connector bot reviewed Mar 14, 2026

View reviewed changes

This was referenced Mar 14, 2026

fix: load pierre themes without json module imports #45869

Merged

fix: align windows path tests with runtime behavior #45870

Open

Uh oh!

Conversation

NickHood1984 commented Mar 9, 2026

Summary

Change Type (select all)

Scope (select all touched areas)

Linked Issue/PR

User-visible / Behavior Changes

Security Impact (required)

Repro + Verification

Environment

Steps

Expected

Actual

Evidence

Human Verification (required)

Review Conversations

Compatibility / Migration

Failure Recovery (if this breaks)

Risks and Mitigations

Uh oh!

greptile-apps bot commented Mar 9, 2026

Greptile Summary

Confidence Score: 3/5

Uh oh!

greptile-apps bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

NickHood1984 commented Mar 13, 2026

Uh oh!

NickHood1984 commented Mar 13, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

NickHood1984 commented Mar 13, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

NickHood1984 commented Mar 14, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 14, 2026

Choose a reason for hiding this comment

Uh oh!

NickHood1984 commented Mar 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants