-
-
Notifications
You must be signed in to change notification settings - Fork 39.8k
Description
Summary
Cron jobs stop executing for long periods (hours+) even though the Gateway process remains up and interactive commands may still work. The internal cron lane appears to wedge behind a job that never reaches a lane task done: lane=cron completion.
This causes all scheduled digests/alerts to silently stop, and any cron-based watchdogs also stop (same failure domain). Manual gateway restart clears the condition.
Environment
- Host: macOS (arm64)
- OpenClaw: 2026.2.3-1
- Gateway mode: local (ws://127.0.0.1:18789)
What I observe
- Regular cron jobs (e.g.
telegram-health-check,gateway-rpc-health-check,notification-audit, etc.) run normally for a while. - Then at some point, a cron session starts processing and the system never logs
lane task done: lane=cronafterwards. - After that moment, no cron sessions produce transcript activity for many hours (verified by
~/.openclaw/agents/main/sessions/*.jsonlmtimes and a local watcher state file). - Gateway restart restores cron execution.
Evidence (log excerpts)
Captured from an incident bundle created on the host:
Last known activity
From my local state file (cron-scheduler-watch.json):
lastSeenActivityEpoch: 1770309855 (Feb 5, 2026 ~08:44 PT)
Cron lane dequeues a job and then stops completing
From incident-20260205-084413.log around 2026-02-05T16:44:00Z:
lane enqueue: lane=session:agent:main:cron:5eb4a3da-... queueSize=1lane dequeue: lane=session:agent:main:cron:5eb4a3da-...lane enqueue: lane=cron queueSize=1lane dequeue: lane=cron ...embedded run start: runId=acc80e6c-... sessionId=acc80e6c-...
After that point, there is no subsequent lane task done: lane=cron entry in that capture.
(For context, earlier in the same capture there are multiple successful lane task done: lane=cron lines, e.g. at 16:30, 16:32, 16:34, 16:35, 16:42Z.)
Impact
- Missed scheduled digests and alerting (e.g. daily suggestions digest, Reddit digest, etc.)
- Cron-based watchdogs cannot detect/recover because they also stop when cron is wedged.
Workaround
I installed an out-of-band watchdog (macOS LaunchAgent) that runs every 5 minutes:
- Checks for recent cron transcript activity.
- If stale, captures an incident bundle and mutex-restarts the gateway.
This reduces user-visible downtime but is obviously not ideal.
Request
- Any known issue/duplicate for cron lane wedging specifically?
- Can cron execution be protected with per-job timeouts / cancellation so one stuck job cannot wedge the entire scheduler?
- Any recommended debug flags to capture where the cron worker is blocked (e.g. stuck
exec, stuck gateway admin request, stuck model call)?
If maintainers want, I can attach the full incident bundle(s) (they contain redacted config summaries and gateway logs).