Skip to content

Cron scheduler auto-dispatch stalls after SIGUSR1 restart — jobs never fire despite scheduler reporting 'started' #11013

@Proxify

Description

@Proxify

Bug Description

After a Gateway SIGUSR1 restart (triggered by a config change), the cron scheduler stops dispatching jobs automatically. The scheduler reports itself as "started" and advances nextWakeAtMs correctly, but never actually spawns isolated sessions to execute due jobs.

Environment

  • OpenClaw version: 2026.2.3-1
  • OS: WSL2 Ubuntu 24.04 on Windows
  • Node.js: v22.22.0
  • Runtime: systemd user service (openclaw-gateway.service)

Steps to Reproduce

  1. Have multiple recurring cron jobs configured (mix of every and cron schedules, all sessionTarget: "isolated")
  2. Trigger a Gateway SIGUSR1 restart (e.g., via gateway(action="restart") tool call or config change)
  3. Observe that cron jobs stop firing after the restart

Expected Behavior

After a SIGUSR1 restart, the cron scheduler should re-arm all timers and dispatch jobs when their nextRunAtMs is reached.

Actual Behavior

  • cron(action="status") reports { enabled: true, jobs: 11, nextWakeAtMs: <future timestamp> } — looks healthy
  • nextWakeAtMs advances past job fire times — the scheduler tick is running
  • But zero isolated sessions are spawned — no cron:<jobId> sessions appear in sessions_list
  • cron(action="runs") shows no new entries after the restart
  • No dispatch, spawn, agentTurn, or isolated log entries appear in Gateway logs
  • Gateway logs only show cron: started at boot — no subsequent dispatch activity

Attempted Fixes (all failed to restore auto-dispatch)

  1. SIGUSR1 restart via gateway(action="restart") — scheduler re-initializes but still doesn't dispatch
  2. Full process restart via systemctl --user restart openclaw-gateway.service — new PID, scheduler says "started", still doesn't dispatch
  3. cron(action="wake", mode="now") — returns { ok: true } but no jobs fire
  4. Manually resetting nextRunAtMs in jobs.json to past timestamps — scheduler recalculates forward from interval on restart, still doesn't dispatch
  5. Clearing lastRunAtMs and lastStatus from all job state — same result

Workaround That Works

openclaw cron run <jobId> --force successfully executes jobs. The execution pipeline itself is fine — only the automatic timer-based dispatch is broken.

Current workaround: system crontab entries that call openclaw cron run <jobId> --force on each job's interval.

Timeline

  • Feb 5, 2:58 PM CT: Last successful automatic cron dispatch (Kalshi scanner job)
  • Feb 5, 4:43 PM CT: Gateway SIGUSR1 restart triggered by config change (block streaming + TTS settings)
  • Feb 5, 4:43 PM CT → Feb 7, 2:30 AM CT: ~35 hours of zero automatic dispatches across all 11 jobs
  • Feb 6, 11:47 PM CT: Another session detected the stall and attempted SIGUSR1 restart — didn't fix it
  • Feb 7, 2:04 AM CT: Full systemctl restart — didn't fix auto-dispatch
  • Feb 7, 2:30 AM CT: openclaw cron run --force confirmed jobs can execute manually

Key Observation

The wakeMode on all jobs is "next-heartbeat". The heartbeat subsystem IS running (5 min interval, confirmed in logs). But the cron dispatcher doesn't appear to check for due jobs during heartbeat ticks — or the check silently fails/skips.

Relevant Log Entries

# Cron says "started" after restart but never dispatches:
{"module":"cron"} {"enabled":true,"jobs":11,"nextWakeAtMs":1770452046127} "cron: started"

# Heartbeat running fine:
{"subsystem":"gateway/heartbeat"} {"intervalMs":300000} "heartbeat: started"

# Zero entries matching: dispatch, spawn, agentTurn, isolated, fire, tick
grep -c "dispatch\|spawn\|agentTurn\|isolated" /tmp/openclaw/openclaw-2026-02-07.log
# Result: 0

Job Store Path

~/.openclaw/cron/jobs.json — 11 jobs, all enabled: true, mix of schedule.kind: "every" (5m, 15m, 60m, 180m, 360m) and schedule.kind: "cron".

Suggested Investigation

The scheduler tick advances nextWakeAtMs but the dispatch codepath that should compare nextRunAtMs against Date.now() and spawn isolated sessions appears to be silently broken after SIGUSR1. Possibly a stale reference to the old scheduler instance, or the dispatch callback isn't re-registered on the new timer.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions