Fix #10904: Add hard timeout to lane tasks to prevent cron wedging#11522
Fix #10904: Add hard timeout to lane tasks to prevent cron wedging#11522divol89 wants to merge 15 commits intoopenclaw:mainfrom
Conversation
When configuring Ollama via CLI (e.g., 'openclaw config set models.providers.ollama.apiKey'), the validation was failing because baseUrl was required. Changes: - Make baseUrl optional in ModelProviderSchema - Apply default baseUrl 'http://localhost:11434' for Ollama in applyModelDefaults Fixes openclaw#9652
When users send atMs as a numeric string (e.g., '1234567890') via the cron tool, the normalization was failing to parse it correctly because parseAbsoluteTimeMs expects ISO date strings. This caused schedule.at to be undefined, which made computeJobNextRunAtMs return undefined, leaving jobs without state.nextRunAtMs set. Jobs would never execute because the scheduler couldn't determine when they were due. Changes: - Add parseNumericStringToMs helper to convert numeric strings to timestamps - Use it as fallback in coerceSchedule when parseAbsoluteTimeMs fails Fixes openclaw#9668
When the timer fires slightly after the scheduled time (even 1ms late), the previous order of operations caused jobs to be skipped: 1. ensureLoaded called recomputeNextRuns, which advanced nextRunAtMs to the NEXT occurrence (e.g., 14:00 instead of 12:00) 2. runDueJobs then checked if jobs were due, but nextRunAtMs was already in the future, so no jobs ran The fix reorders operations in onTimer: 1. Load store WITHOUT recomputing (preserve stored nextRunAtMs) 2. Check and run due jobs using stored nextRunAtMs values 3. THEN recompute next runs for subsequent executions 4. Persist and arm timer This ensures jobs are checked against their original scheduled times before any recomputation happens. Changes: - store.ts: Add skipRecompute option to ensureLoaded - timer.ts: Reorder operations, call recomputeNextRuns after runDueJobs Fixes openclaw#9661
When agents create cron reminders, the results were not being delivered to users because there was no way to specify the delivery channel. Changes: - Add deliver, channel, and to parameters to CronToolSchema - In the 'add' action, build delivery config when these are provided - Only apply delivery for isolated agentTurn jobs (as per constraints) This allows agents to create reminders that deliver results back to the originating channel by setting channel=<channel-id> and optionally to=<user>. Fixes openclaw#9683
When a Signal message is edited, signal-cli provides an editMessage envelope containing targetSentTimestamp (original message) and new dataMessage content. Previously, edited messages were treated as entirely new messages, creating duplicate context and potentially triggering duplicate responses. Changes: - Detect editMessage envelopes by checking for targetSentTimestamp - Add [edited] marker to edited message text for visibility - Use targetSentTimestamp as messageId to help with deduplication This allows users to see when messages are edited and helps prevent duplicate processing of the same logical message. Fixes openclaw#9656
When opening Tool Output in the Chat view with large content (>10KB), the browser would freeze for 10+ seconds and CPU usage spiked to 100%. Root cause: marked.parse() is synchronous and can be very slow with large inputs or certain patterns, even with the previous 40KB limit. Changes: - Lower MARKDOWN_PARSE_LIMIT from 40KB to 20KB - Add MARKDOWN_PRE_WRAP_LIMIT at 10KB (new fast path) - For content >10KB: skip markdown parsing entirely, render as pre-wrap - Add white-space: pre-wrap and word-break for readable large outputs This ensures tool outputs display immediately without blocking the UI, while still supporting markdown formatting for smaller outputs. Fixes openclaw#9700
openclaw cron list was crashing with 'TypeError: Cannot read properties of undefined (reading trim)' when displaying jobs with schedule type 'at' that had undefined or missing 'at' field. The formatIsoMinute function expected a string but was receiving undefined when the schedule.at field was not set. Changes: - Update formatIsoMinute to accept string | undefined - Return '-' early if iso is undefined/empty - Prevents crash when displaying malformed cron jobs Fixes openclaw#9649
The heartbeat.model override feature was only checking agents.defaults.heartbeat.model and ignoring per-agent heartbeat configuration in agents.list[].heartbeat.model. Changes: - Import resolveAgentConfig to get per-agent configuration - Check specific agent's heartbeat.model first, then fall back to defaults - This allows per-agent heartbeat model overrides to work correctly Fixes openclaw#9556
…ode proxy When using browser commands through a node proxy (browser.proxy command), the profile parameter was being lost because the server was looking for it in query.profile instead of params.profile. Changes: - Add profile field to BrowserRequestParams type - Read profile from typed.profile instead of query.profile This ensures that when profile="my-browser" is specified, it is correctly passed through the node proxy to the browser service. Fixes openclaw#9723
When a channel posts to a group, msg.from.id returns a fake system ID that makes all channels appear as the same sender. The correct source is msg.sender_chat.id for channel messages. Changes: - Check msg.sender_chat.id first (for channel posts) - Fall back to msg.from.id (for user messages) - This correctly distinguishes between different channels Fixes openclaw#9719
Adds support for custom baseUrl in OpenAI TTS configuration, enabling
usage of OpenAI-compatible local TTS servers (Chatterbox, Coqui, LocalAI, etc.)
Changes:
- Add baseUrl field to OpenAI TTS config type (types.tts.ts)
- Add baseUrl to Zod schema (zod-schema.core.ts)
- Resolve baseUrl in TTS config (tts.ts)
- Pass baseUrl to openaiTTS function
- Use config baseUrl if provided, fall back to env/default
Example usage:
{
messages: {
tts: {
openai: {
baseUrl: http://localhost:8004,
model: tts-1,
voice: alloy
}
}
}
}
Fixes openclaw#9709
…eout When QMD times out, FallbackMemoryManager sets primaryFailed=true and never retries, even after gateway restart. This is because the manager instance is cached in QMD_MANAGER_CACHE with the failed state. Changes: - Call onClose() when primary fails to clear the cache - This allows fresh retry on next memory_search call after restart Fixes openclaw#9705
…vent duplicates When the gateway restarted multiple times with commands.nativeSkills set to "auto", Telegram commands were appended instead of replaced. This caused skills to appear with duplicated suffixes (_2, _3, etc.) in the command menu. The fix calls deleteMyCommands before setMyCommands to ensure a clean slate. Fixes openclaw#10875 Wallet: BYCgQQpJTJT1odaunfvk6gtm5hVd7Xu93vYwbumFfqgHb3
…dging The cron lane was wedging when a task hung indefinitely, leaving state.active stuck at 1 and blocking all subsequent jobs. This adds a 5-minute hard timeout via Promise.race to ensure wedged tasks fail with an error instead of blocking the lane forever. Fixes openclaw#10904 Wallet: BYCgQQpJT1odaunfvk6gtm5hVd7Xu93vYwbumFfqgHb3
Additional Comments (2)
Prompt To Fix With AIThis is a comment left during a code review.
Path: src/process/command-queue.ts
Line: 60:72
Comment:
**Timeout leaks timer**
`timeoutPromise` creates a `setTimeout` that is never cleared when `entry.task()` resolves/rejects before the timeout. Over time, frequent lane tasks will accumulate pending timers and can keep the event loop busy unnecessarily. Store the timer handle and `clearTimeout()` it in a `finally` around the `Promise.race` (or use an `AbortController`-style timeout utility that cancels the timer).
How can I resolve this? If you propose a fix, please make it concise.
When Prompt To Fix With AIThis is a comment left during a code review.
Path: src/signal/monitor/event-handler.ts
Line: 566:571
Comment:
**Edit messageId may be "undefined"**
When `isEdit` is true, `messageId` is always set to `String(editTargetTimestamp)`, but `editTargetTimestamp` can be `undefined` (if `targetSentTimestamp` is missing or non-numeric). That yields a literal `"undefined"` messageId, which can break deduplication logic downstream. Consider guarding this (e.g., only use `targetSentTimestamp` when it’s a finite number, otherwise fall back to `envelope.timestamp`).
How can I resolve this? If you propose a fix, please make it concise. |
- Clear timeout timer in command-queue to prevent timer leaks - Guard against 'undefined' string messageId in signal event handler
|
Same issue. Is this fix going to be released? |
|
Soon as posible |
bfc1ccb to
f92900f
Compare
Problem
The cron scheduler lane wedges when a task hangs indefinitely. The
state.activecounter never decrements, blocking all subsequent jobs.Root Cause
Lane tasks execute without any timeout. If a cron job (e.g., isolated agent turn) gets stuck waiting for model response, exec completion, or network I/O, the lane remains "active" forever.
Fix
Add a 5-minute hard timeout via
Promise.raceto ensure wedged tasks fail with an error instead of blocking the lane forever.Changes
TASK_TIMEOUT_MS = 300_000constant (5 minutes)entry.task()inPromise.racewith timeoutstate.activeFixes #10904
Wallet: BYCgQQpJT1odaunfvk6gtm5hVd7Xu93vYwbumFfqgHb3
Greptile Overview
Greptile Summary
This PR makes cron scheduling and related subsystems more robust by (1) adding a hard timeout around lane task execution to prevent the cron lane from wedging permanently, and (2) tightening/expanding a few configuration and delivery behaviors (cron delivery fields, optional provider baseUrl defaults, per-agent heartbeat model resolution, and some UI markdown performance limits). It also adjusts cron store/timer loading so the timer tick uses persisted
nextRunAtMsfor determining due jobs, then recomputes next runs after executing due jobs, and includes small fixes in Signal/Telegram/TTS/gateway plumbing.Overall direction is sound, but there are a couple of correctness issues that can affect runtime behavior (timer leak in the new lane timeout wrapper; and edit message deduplication producing
"undefined"IDs).Confidence Score: 3/5
setTimeoutper task (resource leak) and the Signal edit deduplication can emit a literal "undefined" messageId, which can break downstream dedupe. Fixing these should materially reduce risk.