feat: add external rescue watchdog#40791
Conversation
Greptile SummaryThis PR adds an optional external rescue watchdog ( Key findings:
Confidence Score: 3/5
Last reviewed commit: 817a95c |
scripts/rescue-watchdog.sh
Outdated
| if bash -lc "$RUNNER_CMD" >"$incident_dir/runner.stdout" 2>"$incident_dir/runner.stderr"; then | ||
| runner_status="ok" | ||
| else | ||
| runner_status="failed:$?" | ||
| fi |
There was a problem hiding this comment.
Login shell may reset environment for runner
bash -lc "$RUNNER_CMD" spawns a login shell, which sources /etc/profile, ~/.bash_profile, and similar profile scripts. In some environments these profile scripts reset PATH, unset exported variables, or override other env vars that the watchdog carefully exports (e.g. OPENCLAW_RESCUE_INCIDENT_DIR, OPENCLAW_RESCUE_WORKSPACE_DIR). If those variables disappear, the runner will fail with OPENCLAW_RESCUE_INCIDENT_DIR is required.
For the typical operator use-case where RUNNER_CMD is simply a script path, using bash -c (non-login) or direct execution would be more predictable:
| if bash -lc "$RUNNER_CMD" >"$incident_dir/runner.stdout" 2>"$incident_dir/runner.stderr"; then | |
| runner_status="ok" | |
| else | |
| runner_status="failed:$?" | |
| fi | |
| if bash -c "$RUNNER_CMD" >"$incident_dir/runner.stdout" 2>"$incident_dir/runner.stderr"; then | |
| runner_status="ok" | |
| else | |
| runner_status="failed:$?" | |
| fi |
If login-shell PATH resolution is intentionally needed (e.g. so the runner can find claude installed via a shell profile), consider documenting this expectation, or have the runner resolve its own PATH instead.
Prompt To Fix With AI
This is a comment left during a code review.
Path: scripts/rescue-watchdog.sh
Line: 240-244
Comment:
**Login shell may reset environment for runner**
`bash -lc "$RUNNER_CMD"` spawns a login shell, which sources `/etc/profile`, `~/.bash_profile`, and similar profile scripts. In some environments these profile scripts reset `PATH`, unset exported variables, or override other env vars that the watchdog carefully exports (e.g. `OPENCLAW_RESCUE_INCIDENT_DIR`, `OPENCLAW_RESCUE_WORKSPACE_DIR`). If those variables disappear, the runner will fail with `OPENCLAW_RESCUE_INCIDENT_DIR is required`.
For the typical operator use-case where `RUNNER_CMD` is simply a script path, using `bash -c` (non-login) or direct execution would be more predictable:
```suggestion
if bash -c "$RUNNER_CMD" >"$incident_dir/runner.stdout" 2>"$incident_dir/runner.stderr"; then
runner_status="ok"
else
runner_status="failed:$?"
fi
```
If login-shell PATH resolution is intentionally needed (e.g. so the runner can find `claude` installed via a shell profile), consider documenting this expectation, or have the runner resolve its own PATH instead.
How can I resolve this? If you propose a fix, please make it concise.
scripts/claude-rescue-runner.sh
Outdated
| cat >"$PROMPT_FILE" <<EOF | ||
| An OpenClaw rescue watchdog detected a runtime failure. | ||
|
|
||
| Goals: | ||
| - Restore reply capability with the smallest safe fix. | ||
| - Prefer narrow changes over refactors. | ||
| - Do not commit, push, or modify files outside this workspace. | ||
| - Use the captured incident artifacts before exploring broadly. | ||
|
|
||
| Workspace: | ||
| - Repo: $WORKSPACE_DIR | ||
| - Incident dir: $INCIDENT_DIR | ||
|
|
||
| Artifacts: | ||
| - Summary: $SUMMARY_FILE | ||
| - Health snapshot: $HEALTH_FILE | ||
| - CLI status: $STATUS_FILE | ||
| - Gateway log tail: $LOG_FILE | ||
|
|
||
| Required steps: | ||
| 1. Read the incident artifacts first. | ||
| 2. Inspect only the code paths suggested by the incident. | ||
| 3. If you find a safe fix, apply it in this workspace. | ||
| 4. Run focused verification. Prefer pnpm build; if that is too expensive, explain why and run a narrower check. | ||
| 5. If possible, run openclaw health --json --timeout 10000 after the fix. | ||
| 6. Print a concise operator summary with: | ||
| - root cause | ||
| - files changed | ||
| - verification run | ||
| - remaining risk | ||
|
|
||
| Incident summary: | ||
| $summary_text | ||
| EOF |
There was a problem hiding this comment.
Heredoc delimiter could be prematurely closed
The heredoc uses EOF as its delimiter, and $summary_text is expanded inside it. If an openclaw error message ever produces a line that is exactly the string EOF (e.g. a log line like ... error at EOF), bash will close the heredoc at that line, silently truncating the prompt written to $PROMPT_FILE.
Using a less likely delimiter prevents this:
| cat >"$PROMPT_FILE" <<EOF | |
| An OpenClaw rescue watchdog detected a runtime failure. | |
| Goals: | |
| - Restore reply capability with the smallest safe fix. | |
| - Prefer narrow changes over refactors. | |
| - Do not commit, push, or modify files outside this workspace. | |
| - Use the captured incident artifacts before exploring broadly. | |
| Workspace: | |
| - Repo: $WORKSPACE_DIR | |
| - Incident dir: $INCIDENT_DIR | |
| Artifacts: | |
| - Summary: $SUMMARY_FILE | |
| - Health snapshot: $HEALTH_FILE | |
| - CLI status: $STATUS_FILE | |
| - Gateway log tail: $LOG_FILE | |
| Required steps: | |
| 1. Read the incident artifacts first. | |
| 2. Inspect only the code paths suggested by the incident. | |
| 3. If you find a safe fix, apply it in this workspace. | |
| 4. Run focused verification. Prefer pnpm build; if that is too expensive, explain why and run a narrower check. | |
| 5. If possible, run openclaw health --json --timeout 10000 after the fix. | |
| 6. Print a concise operator summary with: | |
| - root cause | |
| - files changed | |
| - verification run | |
| - remaining risk | |
| Incident summary: | |
| $summary_text | |
| EOF | |
| cat >"$PROMPT_FILE" <<'CLAUDE_PROMPT_EOF' | |
| An OpenClaw rescue watchdog detected a runtime failure. | |
| Goals: | |
| - Restore reply capability with the smallest safe fix. | |
| - Prefer narrow changes over refactors. | |
| - Do not commit, push, or modify files outside this workspace. | |
| - Use the captured incident artifacts before exploring broadly. | |
| Workspace: | |
| - Repo: WORKSPACE_DIR_PLACEHOLDER | |
| - Incident dir: INCIDENT_DIR_PLACEHOLDER |
Note: switching to a quoted (<<'...') heredoc disables variable expansion. A safer approach that keeps expansion is to use a unique, non-guessable delimiter:
cat >"$PROMPT_FILE" <<RESCUE_WATCHDOG_PROMPT_END
...
$summary_text
RESCUE_WATCHDOG_PROMPT_ENDThis makes accidental closure essentially impossible while keeping all $variable substitutions working.
Prompt To Fix With AI
This is a comment left during a code review.
Path: scripts/claude-rescue-runner.sh
Line: 34-67
Comment:
**Heredoc delimiter could be prematurely closed**
The heredoc uses `EOF` as its delimiter, and `$summary_text` is expanded inside it. If an `openclaw` error message ever produces a line that is exactly the string `EOF` (e.g. a log line like `... error at EOF`), bash will close the heredoc at that line, silently truncating the prompt written to `$PROMPT_FILE`.
Using a less likely delimiter prevents this:
```suggestion
cat >"$PROMPT_FILE" <<'CLAUDE_PROMPT_EOF'
An OpenClaw rescue watchdog detected a runtime failure.
Goals:
- Restore reply capability with the smallest safe fix.
- Prefer narrow changes over refactors.
- Do not commit, push, or modify files outside this workspace.
- Use the captured incident artifacts before exploring broadly.
Workspace:
- Repo: WORKSPACE_DIR_PLACEHOLDER
- Incident dir: INCIDENT_DIR_PLACEHOLDER
```
Note: switching to a quoted (`<<'...'`) heredoc disables variable expansion. A safer approach that keeps expansion is to use a unique, non-guessable delimiter:
```bash
cat >"$PROMPT_FILE" <<RESCUE_WATCHDOG_PROMPT_END
...
$summary_text
RESCUE_WATCHDOG_PROMPT_END
```
This makes accidental closure essentially impossible while keeping all `$variable` substitutions working.
How can I resolve this? If you propose a fix, please make it concise.|
Addressed the Greptile feedback in d8b47c5.
Re-checked both scripts with |
|
Added a follow-up fix for the failing merge check in This adds explicit exported helper types so the generated declaration surface no longer depends on inferred Local verification on this branch:
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8e1544c663
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if [ -z "$incident_reason" ]; then | ||
| echo "openclaw rescue watchdog: healthy" | ||
| exit 0 |
There was a problem hiding this comment.
Clear suppression state after a healthy cycle
When incident_reason is empty, this branch exits immediately without updating state.env, so LAST_FINGERPRINT and LAST_TRIGGERED_AT from a prior outage remain active. In a fail → recover → fail sequence within OPENCLAW_RESCUE_COOLDOWN_SEC, the second outage is incorrectly treated as a duplicate and skips incident capture/runner execution/notification even though it is a new event after recovery.
Useful? React with 👍 / 👎.
|
Updated the branch to current Local verification on this branch:
The watchdog changes are unchanged; this follow-up only touches test typing/inference on top of the current base. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 22b231e30c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| incident_id="$(date -u +"%Y%m%dT%H%M%SZ")" | ||
| incident_dir="$INCIDENT_ROOT/$incident_id" | ||
| mkdir -p "$incident_dir" |
There was a problem hiding this comment.
Generate unique incident IDs per run
When two watchdog invocations detect an incident within the same second (for example, an overlapping cron run plus a manual run), both derive the same incident_id and write into the same directory because the ID is second-granularity and mkdir -p reuses existing paths. Under set -e, this can cause one run to fail during file writes/copies and leave a mixed or partial incident bundle instead of two independent captures.
Useful? React with 👍 / 👎.
|
Synced this branch with the latest upstream main in commit fc6cfc7 and resolved the merge conflicts in the previously added test-only type fixes.\n\nLocal verification on 2026-03-14:\n- pnpm install --frozen-lockfile\n- pnpm tsgo\n- pnpm exec vitest run --config vitest.unit.config.ts src/plugins/install.test.ts src/hooks/install.test.ts src/gateway/server-runtime-config.test.ts src/gateway/server.talk-config.test.ts src/telegram/network-config.test.ts src/infra/outbound/targets.test.ts src/infra/provider-usage.auth.normalizes-keys.test.ts\n- pnpm check\n\nThis also aligns the npm install assertion helper with the current install command shape after the recent upstream changes. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: fc6cfc7c24
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| fi | ||
| printf '%s\n' "$runner_status" >"$incident_dir/runner.status" | ||
|
|
||
| write_state "$fingerprint" "$now_epoch" "$incident_id" |
There was a problem hiding this comment.
Persist suppression state before invoking runner
The watchdog writes state.env only after RUNNER_CMD returns, which leaves the latest fingerprint/cooldown invisible during long rescue runs. If another invocation starts in that window (for example, cron overlap or a manual rerun), it will treat the same outage as new and run a second incident capture/runner cycle, producing duplicate remediation attempts and notifications against the same failure. Save suppression state before launching the runner (or guard with a lock) so overlapping executions are deduplicated immediately.
Useful? React with 👍 / 👎.
|
Split the inherited CI fixes out into separate PRs so this feature PR can stay scoped:
Those failures were inherited from |
Summary
Change Type (select all)
Scope (select all touched areas)
Linked Issue/PR
User-visible / Behavior Changes
scripts/rescue-watchdog.shto detectopenclaw health --jsonfailures and channellastErrorsnapshots, capture incident bundles, de-duplicate repeats, optionally notify, and optionally invoke a rescue runner.scripts/claude-rescue-runner.shas a minimal Claude Code handoff runner for captured incidents.scripts/systemd/openclaw-rescue-watchdog.{service,timer}templates.docs/automation/rescue-watchdog.mdand linked it from auth monitoring docs.Security Impact (required)
Yes)No)Yes)Yes)Yes)Yes, explain risk + mitigation:The watchdog and runner are optional, external scripts. They execute only when explicitly installed/configured, default to no notification target and no runner, and keep the repair path outside the gateway process. The Claude runner is conservative by default (
acceptEdits) and does not commit or push changes.Repro + Verification
Environment
CLAUDE_BIN=echoOPENCLAW_BIN=falseSteps
bash -n scripts/rescue-watchdog.shandbash -n scripts/claude-rescue-runner.sh.OPENCLAW_BIN=false OPENCLAW_RESCUE_STATE_DIR="$PWD/.artifacts/rescue-watchdog-test" ./scripts/rescue-watchdog.sh.OPENCLAW_RESCUE_INCIDENT_DIR=... OPENCLAW_RESCUE_WORKSPACE_DIR="$PWD" CLAUDE_BIN=echo ./scripts/claude-rescue-runner.shto validate prompt/output generation.Expected
claude-prompt.txtandclaude-output.txt.Actual
Evidence
Human Verification (required)
What you personally verified (not just CI), and how:
healthcommand failure path, repeated identical incident suppression, runner not configured path.Review Conversations
Compatibility / Migration
Yes)Yes)No)Set the optional watchdog/runner environment variables and enable the provided systemd timer if you want the rescue flow.
Failure Recovery (if this breaks)
scripts/systemd/openclaw-rescue-watchdog.{service,timer}from the user systemd dir and unsetOPENCLAW_RESCUE_*/CLAUDE_RESCUE_*env vars.claudeis unavailable on PATH.Risks and Mitigations
Mitigation: the docs explicitly describe the flow as optional and external.
Mitigation: the sample runner defaults to
acceptEdits, does not commit/push, and is fully configurable via env vars.