Skip to content

cron.list blocked by executeJob lock — read operations wait for agent turns to complete #10373

@dario-github

Description

@dario-github

Bug Description

cron.list (and other cron read operations) are blocked for 12–421 seconds because they share a promise-chain lock with executeJob, which holds the lock while waiting for LLM agent turns to complete.

Evidence (gateway logs)

# Normal WS calls during the same period: 50-80ms
2026-02-06T09:19:43.731Z [ws] ⇄ res ✓ chat.history 62ms
2026-02-06T09:21:08.788Z [ws] ⇄ res ✓ chat.history 55ms

# cron.list calls: 12,000-421,000ms
2026-02-06T09:01:33.299Z [ws] ⇄ res ✓ cron.list 77920ms
2026-02-06T10:31:38.411Z [ws] ⇄ res ✓ cron.list 89381ms
2026-02-06T11:01:41.640Z [ws] ⇄ res ✓ cron.list 80715ms
2026-02-06T07:37:24.814Z [ws] ⇄ res ✓ cron.list 421122ms
# gateway.err.log — 15 consecutive timeouts
2026-02-05T20:01:10.313Z [tools] cron failed: gateway timeout after 60000ms
2026-02-06T03:01:12.619Z [tools] cron failed: gateway timeout after 60000ms
2026-02-06T05:01:05.948Z [tools] cron failed: gateway timeout after 60000ms
... (15 total)

All slow cron.list calls occur at :00 / :30 — exactly when heartbeat cron jobs fire.

Root Cause Analysis

In src/cron/service/locked.ts, all cron operations share a single promise-chain lock:

export async function locked<T>(state: CronServiceState, fn: () => Promise<T>): Promise<T> {
  const storeOp = storeLocks.get(storePath) ?? Promise.resolve();
  const next = Promise.all([resolveChain(state.op), resolveChain(storeOp)]).then(fn);
  state.op = keepAlive;
  storeLocks.set(storePath, keepAlive);
  return (await next) as T;
}

In src/cron/service/timer.ts, runDueJobs holds this lock while sequentially awaiting each agent turn:

// Inside onTimer() → locked() scope:
async function runDueJobs(state: CronServiceState) {
  for (const job of due) {
    await executeJob(state, job, now, { forced: false }); // blocks lock for 90s+ per job
  }
}

And executeJob awaits the full LLM agent turn inside the lock:

const res = await state.deps.runIsolatedAgentJob({...}); // 60-300s for LLM completion

Since cron.list also acquires the same lock:

export async function list(state, opts?) {
  return await locked(state, async () => { // ← blocked until executeJob releases
    await ensureLoaded(state);
    return jobs.toSorted(...);
  });
}

Result: A simple read-only list operation waits for ALL due jobs to finish executing sequentially.

Impact

  • Agent tool calls to cron.list timeout at 60s default
  • cron.update also blocked (observed 29s delay)
  • With multiple jobs due simultaneously, lock hold time compounds (421s observed = ~4 jobs × 90s each)
  • Any cron management during active job execution is effectively impossible

Suggested Fix

Separate the execution lock from the state-read lock. Options:

  1. Read-write lock: cron.list / cron.status take a read lock that doesn't block on execution
  2. Execute outside lock: executeJob should release the lock before awaiting runIsolatedAgentJob, re-acquire after to update state
  3. Snapshot for reads: cron.list reads a snapshot of the job array without acquiring the execution lock

Environment

  • OpenClaw version: 2026.2.4
  • 28 active cron jobs
  • Heartbeat schedule: */30 10-20 * * 1-5 (work hours) + 0 0-9,21-23 * * * (off-hours)
  • Typical agent turn duration: 60-120s

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions