Skip to content

Heartbeat scheduler dies silently when runOnce() throws during session compaction #14892

@joeykrug

Description

@joeykrug

Bug Description

The heartbeat scheduler's run() function in startHeartbeatRunner() has no try/catch around the runOnce() call. If runOnce() (which calls getReplyFromConfig) throws an unhandled exception — which appears to happen when the heartbeat session compacts — the scheduleNext() call at the end of run() is never reached. The timer is never rescheduled, and heartbeats silently stop forever until the gateway is restarted.

Steps to Reproduce

  1. Configure heartbeat with every: 60m
  2. Let the heartbeat session accumulate context over many runs
  3. Wait for the heartbeat session to hit compaction threshold
  4. After compaction, heartbeats never fire again

Evidence from Logs

# Heartbeats running normally every ~60m:
Feb 11 07:03  messageChannel=heartbeat
Feb 11 07:20  messageChannel=heartbeat
Feb 11 08:20  messageChannel=heartbeat  ← last one before session compacted

# 34 hours of silence — no heartbeats, no errors logged

# Only fixed by gateway restart:
Feb 12 18:53  [heartbeat] started

Root Cause

In health-format-*.js, the run() function inside startHeartbeatRunner():

const run = async (params) => {
    // ...
    for (const agent of state.agents.values()) {
        // No try/catch here:
        const res = await runOnce({ ... });
        // If runOnce throws, we never reach:
        // - agent.lastRunMs = now
        // - agent.nextDueMs = now + agent.intervalMs
    }
    scheduleNext();  // Never called if runOnce throws
};

Also: the early return for requests-in-flight skips scheduleNext(), which could also strand the timer in edge cases.

Suggested Fix

for (const agent of state.agents.values()) {
    if (isInterval && now < agent.nextDueMs) continue;
    let res;
    try {
        res = await runOnce({ ... });
    } catch (runErr) {
        log.error(\`heartbeat runner: runOnce threw: \${runErr?.message ?? runErr}\`);
        agent.lastRunMs = now;
        agent.nextDueMs = now + agent.intervalMs;
        continue;
    }
    if (res.status === 'skipped' && res.reason === 'requests-in-flight') {
        scheduleNext();  // Don't forget to reschedule before returning
        return res;
    }
    // ... rest unchanged
}
scheduleNext();

Workaround

Applied the above patch locally to dist/health-format-*.js. Also set up a watchdog cron that restarts the gateway if no heartbeats fire for 2+ hours.

Environment

  • OpenClaw 2026.2.6-3
  • Model: claude-opus-4-6 with 1M context window (compaction thresholds set to 200k)
  • Heartbeat model: claude-sonnet-4-20250514
  • OS: Linux 6.12.67 (x64)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions