-
-
Notifications
You must be signed in to change notification settings - Fork 69.5k
Subagent timeout status race condition: workers show 'timed out' when successfully completed #53106
Description
Bug Description
Subagent workers can show status: timed out in completion events even when they completed successfully with valid results. This is a race condition in the runtime outcome tracking.
Evidence
When running a subagent test suite with 5 parallel workers, workers 4 and 5 showed:
{
"status": "timed out",
"runtime": "1s",
"tokens": { "in": 0, "out": 0 }
}But the actual result content showed successful completion:
"Worker 4 executed successfully ✅"
"Worker 5 executed successfully ✅"
Session keys confirmed, full execution details present
Key indicator: Token count shows 0 despite having actual result content — proves the stats/status are out of sync with the result.
Root Cause Analysis
Per the documentation:
"Status is not inferred from model output; it comes from runtime outcome signals."
The status is determined by the runtime, not by parsing output. The race condition occurs between:
- Worker completing successfully
- Timeout timer firing
- Status being recorded
Timeline:
T25: Worker completes task successfully
T26: Success result is being written to transcript
T30: Timeout timer fires ← RACE CONDITION
T31: Status set to "timeout" (overwrites or races with success)
T32: Completion event sent with status="timeout" but result="success"
The timeout mechanism uses a separate timer/async task. When it fires, it aborts the run and sets status to "timeout". If the worker completed just before the timeout, the success result is already in the transcript, but the status field gets set incorrectly.
Steps to Reproduce
- Spawn multiple subagents in parallel (5+ workers)
- Some workers will complete very quickly (< 5 seconds)
- Some completion events will show
status: "timed out"with 0 tokens - But the actual result content shows successful execution
Not easily reproducible on demand — it's a race condition that depends on timing.
Expected Behavior
- Status should be "completed successfully" when the worker finishes before timeout
- Token counts should reflect actual usage
- Stats and status should be consistent with result content
Actual Behavior
- Status shows "timed out" even when result content shows success
- Token counts show 0 despite having result content
- Stats and status are inconsistent with result
Impact
| Area | Impact |
|---|---|
| Functionality | Low — Results are still delivered correctly |
| Observability | High — Status reporting is unreliable |
| Monitoring | High — Can't trust timeout alerts |
| Debugging | Medium — False positives make debugging harder |
Suggested Fixes
Option 1: Atomic Status Update
// Before setting timeout status, check if already completed
if (run.status === 'running') {
run.status = 'timeout';
}Option 2: Completion Timestamp Check
// If completion happened before timeout, use success status
if (run.completedAt && run.completedAt < timeoutFiredAt) {
status = 'success';
} else {
status = 'timeout';
}Option 3: Lock-Based Synchronization
// Use a lock/mutex when updating status
await run.lock.acquire();
try {
if (run.status === 'running') {
run.status = 'timeout';
}
} finally {
run.lock.release();
}Environment
- OpenClaw version: 2026.3.22
- Node.js: v22.22.0
- Platform: Linux (WSL2)
- Model provider: Alibaba Model Studio (via Anthropic-compatible endpoint)
Related
- Investigation document:
docs/research/subagent-deep-root-cause-analysis.md(in workspace) - Test results:
docs/testing/subagent-bug-investigation.md(in workspace)
Additional Context
This was discovered during a comprehensive subagent test suite. The race condition appears to be intermittent — it affected workers 4 and 5 in a batch of 5 parallel workers, while workers 1-3 showed correct "completed successfully" status.
The issue is in the runtime outcome tracking mechanism, not in model output parsing. The result content is correctly captured, but the status field is set incorrectly due to the race.