Skip to content

Fix #10904: Add hard timeout to lane tasks to prevent cron wedging#11522

Open
divol89 wants to merge 15 commits intoopenclaw:mainfrom
divol89:fix/10904-cron-lane-timeout
Open

Fix #10904: Add hard timeout to lane tasks to prevent cron wedging#11522
divol89 wants to merge 15 commits intoopenclaw:mainfrom
divol89:fix/10904-cron-lane-timeout

Conversation

@divol89
Copy link

@divol89 divol89 commented Feb 7, 2026

Problem

The cron scheduler lane wedges when a task hangs indefinitely. The state.active counter never decrements, blocking all subsequent jobs.

Root Cause

Lane tasks execute without any timeout. If a cron job (e.g., isolated agent turn) gets stuck waiting for model response, exec completion, or network I/O, the lane remains "active" forever.

Fix

Add a 5-minute hard timeout via Promise.race to ensure wedged tasks fail with an error instead of blocking the lane forever.

Changes

  • Added TASK_TIMEOUT_MS = 300_000 constant (5 minutes)
  • Wrapped entry.task() in Promise.race with timeout
  • Tasks that exceed the timeout throw and decrement state.active

Fixes #10904

Wallet: BYCgQQpJT1odaunfvk6gtm5hVd7Xu93vYwbumFfqgHb3

Greptile Overview

Greptile Summary

This PR makes cron scheduling and related subsystems more robust by (1) adding a hard timeout around lane task execution to prevent the cron lane from wedging permanently, and (2) tightening/expanding a few configuration and delivery behaviors (cron delivery fields, optional provider baseUrl defaults, per-agent heartbeat model resolution, and some UI markdown performance limits). It also adjusts cron store/timer loading so the timer tick uses persisted nextRunAtMs for determining due jobs, then recomputes next runs after executing due jobs, and includes small fixes in Signal/Telegram/TTS/gateway plumbing.

Overall direction is sound, but there are a couple of correctness issues that can affect runtime behavior (timer leak in the new lane timeout wrapper; and edit message deduplication producing "undefined" IDs).

Confidence Score: 3/5

  • This PR is close to safe to merge but has a couple of concrete runtime issues to address first.
  • Most changes are straightforward and align with the stated goal, but the new lane timeout wrapper introduces an uncleared setTimeout per task (resource leak) and the Signal edit deduplication can emit a literal "undefined" messageId, which can break downstream dedupe. Fixing these should materially reduce risk.
  • src/process/command-queue.ts, src/signal/monitor/event-handler.ts

Shadow added 14 commits February 5, 2026 15:52
When configuring Ollama via CLI (e.g., 'openclaw config set models.providers.ollama.apiKey'),
the validation was failing because baseUrl was required.

Changes:
- Make baseUrl optional in ModelProviderSchema
- Apply default baseUrl 'http://localhost:11434' for Ollama in applyModelDefaults

Fixes openclaw#9652
When users send atMs as a numeric string (e.g., '1234567890') via the
cron tool, the normalization was failing to parse it correctly because
parseAbsoluteTimeMs expects ISO date strings.

This caused schedule.at to be undefined, which made computeJobNextRunAtMs
return undefined, leaving jobs without state.nextRunAtMs set. Jobs would
never execute because the scheduler couldn't determine when they were due.

Changes:
- Add parseNumericStringToMs helper to convert numeric strings to timestamps
- Use it as fallback in coerceSchedule when parseAbsoluteTimeMs fails

Fixes openclaw#9668
When the timer fires slightly after the scheduled time (even 1ms late),
the previous order of operations caused jobs to be skipped:

1. ensureLoaded called recomputeNextRuns, which advanced nextRunAtMs to
the NEXT occurrence (e.g., 14:00 instead of 12:00)
2. runDueJobs then checked if jobs were due, but nextRunAtMs was already
in the future, so no jobs ran

The fix reorders operations in onTimer:
1. Load store WITHOUT recomputing (preserve stored nextRunAtMs)
2. Check and run due jobs using stored nextRunAtMs values
3. THEN recompute next runs for subsequent executions
4. Persist and arm timer

This ensures jobs are checked against their original scheduled times
before any recomputation happens.

Changes:
- store.ts: Add skipRecompute option to ensureLoaded
- timer.ts: Reorder operations, call recomputeNextRuns after runDueJobs

Fixes openclaw#9661
When agents create cron reminders, the results were not being delivered
to users because there was no way to specify the delivery channel.

Changes:
- Add deliver, channel, and to parameters to CronToolSchema
- In the 'add' action, build delivery config when these are provided
- Only apply delivery for isolated agentTurn jobs (as per constraints)

This allows agents to create reminders that deliver results back to the
originating channel by setting channel=<channel-id> and optionally to=<user>.

Fixes openclaw#9683
When a Signal message is edited, signal-cli provides an editMessage envelope
containing targetSentTimestamp (original message) and new dataMessage content.

Previously, edited messages were treated as entirely new messages, creating
duplicate context and potentially triggering duplicate responses.

Changes:
- Detect editMessage envelopes by checking for targetSentTimestamp
- Add [edited] marker to edited message text for visibility
- Use targetSentTimestamp as messageId to help with deduplication

This allows users to see when messages are edited and helps prevent
duplicate processing of the same logical message.

Fixes openclaw#9656
When opening Tool Output in the Chat view with large content (>10KB),
the browser would freeze for 10+ seconds and CPU usage spiked to 100%.

Root cause: marked.parse() is synchronous and can be very slow with large
inputs or certain patterns, even with the previous 40KB limit.

Changes:
- Lower MARKDOWN_PARSE_LIMIT from 40KB to 20KB
- Add MARKDOWN_PRE_WRAP_LIMIT at 10KB (new fast path)
- For content >10KB: skip markdown parsing entirely, render as pre-wrap
- Add white-space: pre-wrap and word-break for readable large outputs

This ensures tool outputs display immediately without blocking the UI,
while still supporting markdown formatting for smaller outputs.

Fixes openclaw#9700
openclaw cron list was crashing with 'TypeError: Cannot read properties
of undefined (reading trim)' when displaying jobs with schedule type 'at'
that had undefined or missing 'at' field.

The formatIsoMinute function expected a string but was receiving undefined
when the schedule.at field was not set.

Changes:
- Update formatIsoMinute to accept string | undefined
- Return '-' early if iso is undefined/empty
- Prevents crash when displaying malformed cron jobs

Fixes openclaw#9649
The heartbeat.model override feature was only checking agents.defaults.heartbeat.model
and ignoring per-agent heartbeat configuration in agents.list[].heartbeat.model.

Changes:
- Import resolveAgentConfig to get per-agent configuration
- Check specific agent's heartbeat.model first, then fall back to defaults
- This allows per-agent heartbeat model overrides to work correctly

Fixes openclaw#9556
…ode proxy

When using browser commands through a node proxy (browser.proxy command),
the profile parameter was being lost because the server was looking for it
in query.profile instead of params.profile.

Changes:
- Add profile field to BrowserRequestParams type
- Read profile from typed.profile instead of query.profile

This ensures that when profile="my-browser" is specified, it is correctly
passed through the node proxy to the browser service.

Fixes openclaw#9723
When a channel posts to a group, msg.from.id returns a fake system ID
that makes all channels appear as the same sender. The correct source
is msg.sender_chat.id for channel messages.

Changes:
- Check msg.sender_chat.id first (for channel posts)
- Fall back to msg.from.id (for user messages)
- This correctly distinguishes between different channels

Fixes openclaw#9719
Adds support for custom baseUrl in OpenAI TTS configuration, enabling
usage of OpenAI-compatible local TTS servers (Chatterbox, Coqui, LocalAI, etc.)

Changes:
- Add baseUrl field to OpenAI TTS config type (types.tts.ts)
- Add baseUrl to Zod schema (zod-schema.core.ts)
- Resolve baseUrl in TTS config (tts.ts)
- Pass baseUrl to openaiTTS function
- Use config baseUrl if provided, fall back to env/default

Example usage:
{
  messages: {
    tts: {
      openai: {
        baseUrl: http://localhost:8004,
        model: tts-1,
        voice: alloy
      }
    }
  }
}

Fixes openclaw#9709
…eout

When QMD times out, FallbackMemoryManager sets primaryFailed=true and
never retries, even after gateway restart. This is because the manager
instance is cached in QMD_MANAGER_CACHE with the failed state.

Changes:
- Call onClose() when primary fails to clear the cache
- This allows fresh retry on next memory_search call after restart

Fixes openclaw#9705
…vent duplicates

When the gateway restarted multiple times with commands.nativeSkills set to
"auto", Telegram commands were appended instead of replaced. This caused
skills to appear with duplicated suffixes (_2, _3, etc.) in the command menu.

The fix calls deleteMyCommands before setMyCommands to ensure a clean slate.

Fixes openclaw#10875

Wallet: BYCgQQpJTJT1odaunfvk6gtm5hVd7Xu93vYwbumFfqgHb3
…dging

The cron lane was wedging when a task hung indefinitely, leaving
state.active stuck at 1 and blocking all subsequent jobs.

This adds a 5-minute hard timeout via Promise.race to ensure wedged
tasks fail with an error instead of blocking the lane forever.

Fixes openclaw#10904

Wallet: BYCgQQpJT1odaunfvk6gtm5hVd7Xu93vYwbumFfqgHb3
@openclaw-barnacle openclaw-barnacle bot added channel: signal Channel integration: signal channel: telegram Channel integration: telegram app: web-ui App: web-ui gateway Gateway runtime cli CLI command changes agents Agent runtime and tooling labels Feb 7, 2026
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 7, 2026

Additional Comments (2)

src/process/command-queue.ts
Timeout leaks timer

timeoutPromise creates a setTimeout that is never cleared when entry.task() resolves/rejects before the timeout. Over time, frequent lane tasks will accumulate pending timers and can keep the event loop busy unnecessarily. Store the timer handle and clearTimeout() it in a finally around the Promise.race (or use an AbortController-style timeout utility that cancels the timer).

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/process/command-queue.ts
Line: 60:72

Comment:
**Timeout leaks timer**

`timeoutPromise` creates a `setTimeout` that is never cleared when `entry.task()` resolves/rejects before the timeout. Over time, frequent lane tasks will accumulate pending timers and can keep the event loop busy unnecessarily. Store the timer handle and `clearTimeout()` it in a `finally` around the `Promise.race` (or use an `AbortController`-style timeout utility that cancels the timer).

How can I resolve this? If you propose a fix, please make it concise.

src/signal/monitor/event-handler.ts
Edit messageId may be "undefined"

When isEdit is true, messageId is always set to String(editTargetTimestamp), but editTargetTimestamp can be undefined (if targetSentTimestamp is missing or non-numeric). That yields a literal "undefined" messageId, which can break deduplication logic downstream. Consider guarding this (e.g., only use targetSentTimestamp when it’s a finite number, otherwise fall back to envelope.timestamp).

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/signal/monitor/event-handler.ts
Line: 566:571

Comment:
**Edit messageId may be "undefined"**

When `isEdit` is true, `messageId` is always set to `String(editTargetTimestamp)`, but `editTargetTimestamp` can be `undefined` (if `targetSentTimestamp` is missing or non-numeric). That yields a literal `"undefined"` messageId, which can break deduplication logic downstream. Consider guarding this (e.g., only use `targetSentTimestamp` when it’s a finite number, otherwise fall back to `envelope.timestamp`).

How can I resolve this? If you propose a fix, please make it concise.

- Clear timeout timer in command-queue to prevent timer leaks
- Guard against 'undefined' string messageId in signal event handler
@SudarshanSuryaprakash
Copy link

Same issue. Is this fix going to be released?

@divol89
Copy link
Author

divol89 commented Feb 8, 2026

Soon as posible

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling app: web-ui App: web-ui channel: signal Channel integration: signal channel: telegram Channel integration: telegram cli CLI command changes gateway Gateway runtime

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Cron scheduler lane wedges (jobs stop running for hours) while gateway remains responsive; restart clears

2 participants

Comments