Skip to content

[Bug] Cron scheduler lane wedges (jobs stop running for hours) while gateway remains responsive; restart clears #10904

@cmfinlan

Description

@cmfinlan

Summary

Cron jobs stop executing for long periods (hours+) even though the Gateway process remains up and interactive commands may still work. The internal cron lane appears to wedge behind a job that never reaches a lane task done: lane=cron completion.

This causes all scheduled digests/alerts to silently stop, and any cron-based watchdogs also stop (same failure domain). Manual gateway restart clears the condition.

Environment

  • Host: macOS (arm64)
  • OpenClaw: 2026.2.3-1
  • Gateway mode: local (ws://127.0.0.1:18789)

What I observe

  • Regular cron jobs (e.g. telegram-health-check, gateway-rpc-health-check, notification-audit, etc.) run normally for a while.
  • Then at some point, a cron session starts processing and the system never logs lane task done: lane=cron afterwards.
  • After that moment, no cron sessions produce transcript activity for many hours (verified by ~/.openclaw/agents/main/sessions/*.jsonl mtimes and a local watcher state file).
  • Gateway restart restores cron execution.

Evidence (log excerpts)

Captured from an incident bundle created on the host:

Last known activity

From my local state file (cron-scheduler-watch.json):

  • lastSeenActivityEpoch: 1770309855 (Feb 5, 2026 ~08:44 PT)

Cron lane dequeues a job and then stops completing

From incident-20260205-084413.log around 2026-02-05T16:44:00Z:

  • lane enqueue: lane=session:agent:main:cron:5eb4a3da-... queueSize=1
  • lane dequeue: lane=session:agent:main:cron:5eb4a3da-...
  • lane enqueue: lane=cron queueSize=1
  • lane dequeue: lane=cron ...
  • embedded run start: runId=acc80e6c-... sessionId=acc80e6c-...

After that point, there is no subsequent lane task done: lane=cron entry in that capture.

(For context, earlier in the same capture there are multiple successful lane task done: lane=cron lines, e.g. at 16:30, 16:32, 16:34, 16:35, 16:42Z.)

Impact

  • Missed scheduled digests and alerting (e.g. daily suggestions digest, Reddit digest, etc.)
  • Cron-based watchdogs cannot detect/recover because they also stop when cron is wedged.

Workaround

I installed an out-of-band watchdog (macOS LaunchAgent) that runs every 5 minutes:

  • Checks for recent cron transcript activity.
  • If stale, captures an incident bundle and mutex-restarts the gateway.

This reduces user-visible downtime but is obviously not ideal.

Request

  1. Any known issue/duplicate for cron lane wedging specifically?
  2. Can cron execution be protected with per-job timeouts / cancellation so one stuck job cannot wedge the entire scheduler?
  3. Any recommended debug flags to capture where the cron worker is blocked (e.g. stuck exec, stuck gateway admin request, stuck model call)?

If maintainers want, I can attach the full incident bundle(s) (they contain redacted config summaries and gateway logs).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions