Skip to content

feat: add external rescue watchdog#40791

Open
NickHood1984 wants to merge 6 commits intoopenclaw:mainfrom
NickHood1984:codex/rescue-watchdog
Open

feat: add external rescue watchdog#40791
NickHood1984 wants to merge 6 commits intoopenclaw:mainfrom
NickHood1984:codex/rescue-watchdog

Conversation

@NickHood1984
Copy link
Copy Markdown
Contributor

Summary

  • Problem: OpenClaw can report health and channel errors, but there was no external rescue loop that stays alive when the gateway itself is unhealthy.
  • Why it matters: in-process recovery dies with the same process, so failures that stop replies had to be diagnosed and handed off manually.
  • What changed: added an external rescue watchdog script, a Claude Code runner, systemd user service/timer templates, and setup docs.
  • What did NOT change (scope boundary): no core gateway runtime behavior, no automatic commit/push flow, and no bundled launchd plist.

Change Type (select all)

  • Feature
  • Docs
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

  • Related #

User-visible / Behavior Changes

  • Added scripts/rescue-watchdog.sh to detect openclaw health --json failures and channel lastError snapshots, capture incident bundles, de-duplicate repeats, optionally notify, and optionally invoke a rescue runner.
  • Added scripts/claude-rescue-runner.sh as a minimal Claude Code handoff runner for captured incidents.
  • Added scripts/systemd/openclaw-rescue-watchdog.{service,timer} templates.
  • Added docs/automation/rescue-watchdog.md and linked it from auth monitoring docs.

Security Impact (required)

  • New permissions/capabilities? (Yes)
  • Secrets/tokens handling changed? (No)
  • New/changed network calls? (Yes)
  • Command/tool execution surface changed? (Yes)
  • Data access scope changed? (Yes)
  • If any Yes, explain risk + mitigation:
    The watchdog and runner are optional, external scripts. They execute only when explicitly installed/configured, default to no notification target and no runner, and keep the repair path outside the gateway process. The Claude runner is conservative by default (acceptEdits) and does not commit or push changes.

Repro + Verification

Environment

  • OS: macOS
  • Runtime/container: local source checkout
  • Model/provider: Claude Code CLI runner path validated with a stubbed CLAUDE_BIN=echo
  • Integration/channel (if any): none configured for notification during verification
  • Relevant config (redacted): synthetic incident via OPENCLAW_BIN=false

Steps

  1. Run bash -n scripts/rescue-watchdog.sh and bash -n scripts/claude-rescue-runner.sh.
  2. Trigger a synthetic incident with OPENCLAW_BIN=false OPENCLAW_RESCUE_STATE_DIR="$PWD/.artifacts/rescue-watchdog-test" ./scripts/rescue-watchdog.sh.
  3. Re-run the same command to verify duplicate suppression.
  4. Run OPENCLAW_RESCUE_INCIDENT_DIR=... OPENCLAW_RESCUE_WORKSPACE_DIR="$PWD" CLAUDE_BIN=echo ./scripts/claude-rescue-runner.sh to validate prompt/output generation.

Expected

  • The watchdog writes an incident bundle on first failure.
  • Repeated identical incidents within cooldown are skipped.
  • The Claude runner writes claude-prompt.txt and claude-output.txt.

Actual

  • Matched expected behavior.

Evidence

  • Failing test/log before + passing after
  • Trace/log snippets
  • Screenshot/recording
  • Perf numbers (if relevant)

Human Verification (required)

What you personally verified (not just CI), and how:

  • Verified scenarios: shell syntax checks, synthetic incident capture, duplicate suppression, runner prompt/output generation.
  • Edge cases checked: health command failure path, repeated identical incident suppression, runner not configured path.
  • What you did not verify: a live end-to-end repair against a real broken gateway or a real notification channel.

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

  • Backward compatible? (Yes)
  • Config/env changes? (Yes)
  • Migration needed? (No)
  • If yes, exact upgrade steps:
    Set the optional watchdog/runner environment variables and enable the provided systemd timer if you want the rescue flow.

Failure Recovery (if this breaks)

  • How to disable/revert this change quickly: do not install/enable the timer, or remove the copied user systemd unit files.
  • Files/config to restore: remove scripts/systemd/openclaw-rescue-watchdog.{service,timer} from the user systemd dir and unset OPENCLAW_RESCUE_* / CLAUDE_RESCUE_* env vars.
  • Known bad symptoms reviewers should watch for: unexpected repeated incident notifications due to local environment differences, or runner failures because claude is unavailable on PATH.

Risks and Mitigations

  • Risk: operators may assume the rescue path is active by default.
    Mitigation: the docs explicitly describe the flow as optional and external.
  • Risk: the runner may be too permissive for some environments.
    Mitigation: the sample runner defaults to acceptEdits, does not commit/push, and is fully configurable via env vars.

@openclaw-barnacle openclaw-barnacle bot added docs Improvements or additions to documentation scripts Repository scripts size: M labels Mar 9, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 9, 2026

Greptile Summary

This PR adds an optional external rescue watchdog (scripts/rescue-watchdog.sh) that polls openclaw health, captures incident bundles on failure, deduplicates repeat events, optionally sends notifications, and can hand the incident off to a Claude Code runner (scripts/claude-rescue-runner.sh). Systemd user timer templates and setup docs are also included. The change is additive and backward compatible — nothing runs unless explicitly installed.

Key findings:

  • scripts/rescue-watchdog.sh line 240: the runner is launched via bash -lc "$RUNNER_CMD" (login shell). On many systems profile scripts (.bash_profile, etc.) reset PATH or unset environment variables, including the OPENCLAW_RESCUE_* variables the watchdog exports. If those variables are cleared, claude-rescue-runner.sh exits with OPENCLAW_RESCUE_INCIDENT_DIR is required. Using bash -c instead of bash -lc would be safer unless login-shell PATH resolution is explicitly needed.
  • scripts/claude-rescue-runner.sh lines 34–67: the heredoc uses EOF as its delimiter while expanding $summary_text inside it. Although summary_text is currently always single-line, any future openclaw error message that produces a bare EOF line would silently truncate the Claude prompt. Using a less common delimiter (e.g. RESCUE_WATCHDOG_PROMPT_END) eliminates this edge case at no cost.

Confidence Score: 3/5

  • Safe to merge with minor fixes; the login-shell runner invocation could silently break the rescue path in certain environments.
  • The change is fully opt-in and additive — no existing behavior is altered. The two flagged issues are low-to-moderate: the bash -lc runner invocation is a real risk in environments with profile-script side effects, and the heredoc delimiter is a defensive concern rather than a confirmed bug. Neither blocks the happy path described in the PR, but the bash -lc issue could cause the runner to fail silently in CI or containerized setups where profile scripts reset the environment.
  • Pay close attention to scripts/rescue-watchdog.sh (runner invocation at line 240) and scripts/claude-rescue-runner.sh (heredoc delimiter at lines 34–67).

Last reviewed commit: 817a95c

Comment on lines +240 to +244
if bash -lc "$RUNNER_CMD" >"$incident_dir/runner.stdout" 2>"$incident_dir/runner.stderr"; then
runner_status="ok"
else
runner_status="failed:$?"
fi
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Login shell may reset environment for runner

bash -lc "$RUNNER_CMD" spawns a login shell, which sources /etc/profile, ~/.bash_profile, and similar profile scripts. In some environments these profile scripts reset PATH, unset exported variables, or override other env vars that the watchdog carefully exports (e.g. OPENCLAW_RESCUE_INCIDENT_DIR, OPENCLAW_RESCUE_WORKSPACE_DIR). If those variables disappear, the runner will fail with OPENCLAW_RESCUE_INCIDENT_DIR is required.

For the typical operator use-case where RUNNER_CMD is simply a script path, using bash -c (non-login) or direct execution would be more predictable:

Suggested change
if bash -lc "$RUNNER_CMD" >"$incident_dir/runner.stdout" 2>"$incident_dir/runner.stderr"; then
runner_status="ok"
else
runner_status="failed:$?"
fi
if bash -c "$RUNNER_CMD" >"$incident_dir/runner.stdout" 2>"$incident_dir/runner.stderr"; then
runner_status="ok"
else
runner_status="failed:$?"
fi

If login-shell PATH resolution is intentionally needed (e.g. so the runner can find claude installed via a shell profile), consider documenting this expectation, or have the runner resolve its own PATH instead.

Prompt To Fix With AI
This is a comment left during a code review.
Path: scripts/rescue-watchdog.sh
Line: 240-244

Comment:
**Login shell may reset environment for runner**

`bash -lc "$RUNNER_CMD"` spawns a login shell, which sources `/etc/profile`, `~/.bash_profile`, and similar profile scripts. In some environments these profile scripts reset `PATH`, unset exported variables, or override other env vars that the watchdog carefully exports (e.g. `OPENCLAW_RESCUE_INCIDENT_DIR`, `OPENCLAW_RESCUE_WORKSPACE_DIR`). If those variables disappear, the runner will fail with `OPENCLAW_RESCUE_INCIDENT_DIR is required`.

For the typical operator use-case where `RUNNER_CMD` is simply a script path, using `bash -c` (non-login) or direct execution would be more predictable:

```suggestion
  if bash -c "$RUNNER_CMD" >"$incident_dir/runner.stdout" 2>"$incident_dir/runner.stderr"; then
    runner_status="ok"
  else
    runner_status="failed:$?"
  fi
```

If login-shell PATH resolution is intentionally needed (e.g. so the runner can find `claude` installed via a shell profile), consider documenting this expectation, or have the runner resolve its own PATH instead.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +34 to +67
cat >"$PROMPT_FILE" <<EOF
An OpenClaw rescue watchdog detected a runtime failure.

Goals:
- Restore reply capability with the smallest safe fix.
- Prefer narrow changes over refactors.
- Do not commit, push, or modify files outside this workspace.
- Use the captured incident artifacts before exploring broadly.

Workspace:
- Repo: $WORKSPACE_DIR
- Incident dir: $INCIDENT_DIR

Artifacts:
- Summary: $SUMMARY_FILE
- Health snapshot: $HEALTH_FILE
- CLI status: $STATUS_FILE
- Gateway log tail: $LOG_FILE

Required steps:
1. Read the incident artifacts first.
2. Inspect only the code paths suggested by the incident.
3. If you find a safe fix, apply it in this workspace.
4. Run focused verification. Prefer pnpm build; if that is too expensive, explain why and run a narrower check.
5. If possible, run openclaw health --json --timeout 10000 after the fix.
6. Print a concise operator summary with:
- root cause
- files changed
- verification run
- remaining risk

Incident summary:
$summary_text
EOF
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heredoc delimiter could be prematurely closed

The heredoc uses EOF as its delimiter, and $summary_text is expanded inside it. If an openclaw error message ever produces a line that is exactly the string EOF (e.g. a log line like ... error at EOF), bash will close the heredoc at that line, silently truncating the prompt written to $PROMPT_FILE.

Using a less likely delimiter prevents this:

Suggested change
cat >"$PROMPT_FILE" <<EOF
An OpenClaw rescue watchdog detected a runtime failure.
Goals:
- Restore reply capability with the smallest safe fix.
- Prefer narrow changes over refactors.
- Do not commit, push, or modify files outside this workspace.
- Use the captured incident artifacts before exploring broadly.
Workspace:
- Repo: $WORKSPACE_DIR
- Incident dir: $INCIDENT_DIR
Artifacts:
- Summary: $SUMMARY_FILE
- Health snapshot: $HEALTH_FILE
- CLI status: $STATUS_FILE
- Gateway log tail: $LOG_FILE
Required steps:
1. Read the incident artifacts first.
2. Inspect only the code paths suggested by the incident.
3. If you find a safe fix, apply it in this workspace.
4. Run focused verification. Prefer pnpm build; if that is too expensive, explain why and run a narrower check.
5. If possible, run openclaw health --json --timeout 10000 after the fix.
6. Print a concise operator summary with:
- root cause
- files changed
- verification run
- remaining risk
Incident summary:
$summary_text
EOF
cat >"$PROMPT_FILE" <<'CLAUDE_PROMPT_EOF'
An OpenClaw rescue watchdog detected a runtime failure.
Goals:
- Restore reply capability with the smallest safe fix.
- Prefer narrow changes over refactors.
- Do not commit, push, or modify files outside this workspace.
- Use the captured incident artifacts before exploring broadly.
Workspace:
- Repo: WORKSPACE_DIR_PLACEHOLDER
- Incident dir: INCIDENT_DIR_PLACEHOLDER

Note: switching to a quoted (<<'...') heredoc disables variable expansion. A safer approach that keeps expansion is to use a unique, non-guessable delimiter:

cat >"$PROMPT_FILE" <<RESCUE_WATCHDOG_PROMPT_END
...
$summary_text
RESCUE_WATCHDOG_PROMPT_END

This makes accidental closure essentially impossible while keeping all $variable substitutions working.

Prompt To Fix With AI
This is a comment left during a code review.
Path: scripts/claude-rescue-runner.sh
Line: 34-67

Comment:
**Heredoc delimiter could be prematurely closed**

The heredoc uses `EOF` as its delimiter, and `$summary_text` is expanded inside it. If an `openclaw` error message ever produces a line that is exactly the string `EOF` (e.g. a log line like `... error at EOF`), bash will close the heredoc at that line, silently truncating the prompt written to `$PROMPT_FILE`.

Using a less likely delimiter prevents this:

```suggestion
cat >"$PROMPT_FILE" <<'CLAUDE_PROMPT_EOF'
An OpenClaw rescue watchdog detected a runtime failure.

Goals:
- Restore reply capability with the smallest safe fix.
- Prefer narrow changes over refactors.
- Do not commit, push, or modify files outside this workspace.
- Use the captured incident artifacts before exploring broadly.

Workspace:
- Repo: WORKSPACE_DIR_PLACEHOLDER
- Incident dir: INCIDENT_DIR_PLACEHOLDER
```

Note: switching to a quoted (`<<'...'`) heredoc disables variable expansion. A safer approach that keeps expansion is to use a unique, non-guessable delimiter:

```bash
cat >"$PROMPT_FILE" <<RESCUE_WATCHDOG_PROMPT_END
...
$summary_text
RESCUE_WATCHDOG_PROMPT_END
```

This makes accidental closure essentially impossible while keeping all `$variable` substitutions working.

How can I resolve this? If you propose a fix, please make it concise.

@NickHood1984
Copy link
Copy Markdown
Contributor Author

Addressed the Greptile feedback in d8b47c5.

  • switched the runner invocation from bash -lc to bash -c so the exported OPENCLAW_RESCUE_* environment survives runner launch more predictably
  • changed the Claude prompt heredoc delimiter from EOF to RESCUE_WATCHDOG_PROMPT_END to avoid accidental prompt truncation

Re-checked both scripts with bash -n.

@openclaw-barnacle openclaw-barnacle bot added the gateway Gateway runtime label Mar 13, 2026
@NickHood1984
Copy link
Copy Markdown
Contributor Author

Added a follow-up fix for the failing merge check in src/gateway/server-methods/chat.abort.test-helpers.ts.

This adds explicit exported helper types so the generated declaration surface no longer depends on inferred @vitest/spy internals (TS2883).

Local verification on this branch:

  • pnpm check

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8e1544c663

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +175 to +177
if [ -z "$incident_reason" ]; then
echo "openclaw rescue watchdog: healthy"
exit 0
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Clear suppression state after a healthy cycle

When incident_reason is empty, this branch exits immediately without updating state.env, so LAST_FINGERPRINT and LAST_TRIGGERED_AT from a prior outage remain active. In a fail → recover → fail sequence within OPENCLAW_RESCUE_COOLDOWN_SEC, the second outage is incorrectly treated as a duplicate and skips incident capture/runner execution/notification even though it is a new event after recovery.

Useful? React with 👍 / 👎.

@NickHood1984
Copy link
Copy Markdown
Contributor Author

Updated the branch to current main and fixed the resulting test-only type regressions that were failing the merge check.

Local verification on this branch:

  • pnpm check

The watchdog changes are unchanged; this follow-up only touches test typing/inference on top of the current base.

@openclaw-barnacle openclaw-barnacle bot added the channel: telegram Channel integration: telegram label Mar 13, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 22b231e30c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +196 to +198
incident_id="$(date -u +"%Y%m%dT%H%M%SZ")"
incident_dir="$INCIDENT_ROOT/$incident_id"
mkdir -p "$incident_dir"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Generate unique incident IDs per run

When two watchdog invocations detect an incident within the same second (for example, an overlapping cron run plus a manual run), both derive the same incident_id and write into the same directory because the ID is second-granularity and mkdir -p reuses existing paths. Under set -e, this can cause one run to fail during file writes/copies and leave a mixed or partial incident bundle instead of two independent captures.

Useful? React with 👍 / 👎.

@openclaw-barnacle openclaw-barnacle bot removed the gateway Gateway runtime label Mar 14, 2026
@NickHood1984
Copy link
Copy Markdown
Contributor Author

Synced this branch with the latest upstream main in commit fc6cfc7 and resolved the merge conflicts in the previously added test-only type fixes.\n\nLocal verification on 2026-03-14:\n- pnpm install --frozen-lockfile\n- pnpm tsgo\n- pnpm exec vitest run --config vitest.unit.config.ts src/plugins/install.test.ts src/hooks/install.test.ts src/gateway/server-runtime-config.test.ts src/gateway/server.talk-config.test.ts src/telegram/network-config.test.ts src/infra/outbound/targets.test.ts src/infra/provider-usage.auth.normalizes-keys.test.ts\n- pnpm check\n\nThis also aligns the npm install assertion helper with the current install command shape after the recent upstream changes.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fc6cfc7c24

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

fi
printf '%s\n' "$runner_status" >"$incident_dir/runner.status"

write_state "$fingerprint" "$now_epoch" "$incident_id"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Persist suppression state before invoking runner

The watchdog writes state.env only after RUNNER_CMD returns, which leaves the latest fingerprint/cooldown invisible during long rescue runs. If another invocation starts in that window (for example, cron overlap or a manual rerun), it will treat the same outage as new and run a second incident capture/runner cycle, producing duplicate remediation attempts and notifications against the same failure. Save suppression state before launching the runner (or guard with a lock) so overlapping executions are deduplicated immediately.

Useful? React with 👍 / 👎.

@NickHood1984
Copy link
Copy Markdown
Contributor Author

Split the inherited CI fixes out into separate PRs so this feature PR can stay scoped:

Those failures were inherited from main, not introduced by the rescue watchdog changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

channel: telegram Channel integration: telegram docs Improvements or additions to documentation scripts Repository scripts size: M

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants