-
-
Notifications
You must be signed in to change notification settings - Fork 39.8k
Description
Bug Description
The heartbeat scheduler's run() function in startHeartbeatRunner() has no try/catch around the runOnce() call. If runOnce() (which calls getReplyFromConfig) throws an unhandled exception — which appears to happen when the heartbeat session compacts — the scheduleNext() call at the end of run() is never reached. The timer is never rescheduled, and heartbeats silently stop forever until the gateway is restarted.
Steps to Reproduce
- Configure heartbeat with
every: 60m - Let the heartbeat session accumulate context over many runs
- Wait for the heartbeat session to hit compaction threshold
- After compaction, heartbeats never fire again
Evidence from Logs
# Heartbeats running normally every ~60m:
Feb 11 07:03 messageChannel=heartbeat
Feb 11 07:20 messageChannel=heartbeat
Feb 11 08:20 messageChannel=heartbeat ← last one before session compacted
# 34 hours of silence — no heartbeats, no errors logged
# Only fixed by gateway restart:
Feb 12 18:53 [heartbeat] started
Root Cause
In health-format-*.js, the run() function inside startHeartbeatRunner():
const run = async (params) => {
// ...
for (const agent of state.agents.values()) {
// No try/catch here:
const res = await runOnce({ ... });
// If runOnce throws, we never reach:
// - agent.lastRunMs = now
// - agent.nextDueMs = now + agent.intervalMs
}
scheduleNext(); // Never called if runOnce throws
};Also: the early return for requests-in-flight skips scheduleNext(), which could also strand the timer in edge cases.
Suggested Fix
for (const agent of state.agents.values()) {
if (isInterval && now < agent.nextDueMs) continue;
let res;
try {
res = await runOnce({ ... });
} catch (runErr) {
log.error(\`heartbeat runner: runOnce threw: \${runErr?.message ?? runErr}\`);
agent.lastRunMs = now;
agent.nextDueMs = now + agent.intervalMs;
continue;
}
if (res.status === 'skipped' && res.reason === 'requests-in-flight') {
scheduleNext(); // Don't forget to reschedule before returning
return res;
}
// ... rest unchanged
}
scheduleNext();Workaround
Applied the above patch locally to dist/health-format-*.js. Also set up a watchdog cron that restarts the gateway if no heartbeats fire for 2+ hours.
Environment
- OpenClaw 2026.2.6-3
- Model: claude-opus-4-6 with 1M context window (compaction thresholds set to 200k)
- Heartbeat model: claude-sonnet-4-20250514
- OS: Linux 6.12.67 (x64)