Subagent timeout status race condition: workers show 'timed out' when successfully completed

## Bug Description

Subagent workers can show `status: timed out` in completion events even when they completed successfully with valid results. This is a race condition in the runtime outcome tracking.

## Evidence

When running a subagent test suite with 5 parallel workers, workers 4 and 5 showed:

```json
{
  "status": "timed out",
  "runtime": "1s",
  "tokens": { "in": 0, "out": 0 }
}
```

But the actual result content showed successful completion:
```
"Worker 4 executed successfully ✅"
"Worker 5 executed successfully ✅"
Session keys confirmed, full execution details present
```

**Key indicator:** Token count shows 0 despite having actual result content — proves the stats/status are out of sync with the result.

## Root Cause Analysis

Per the documentation:
> "Status is not inferred from model output; it comes from runtime outcome signals."

The status is determined by the runtime, not by parsing output. The race condition occurs between:

1. Worker completing successfully
2. Timeout timer firing
3. Status being recorded

```
Timeline:
T25: Worker completes task successfully
T26: Success result is being written to transcript
T30: Timeout timer fires ← RACE CONDITION
T31: Status set to "timeout" (overwrites or races with success)
T32: Completion event sent with status="timeout" but result="success"
```

The timeout mechanism uses a separate timer/async task. When it fires, it aborts the run and sets status to "timeout". If the worker completed just before the timeout, the success result is already in the transcript, but the status field gets set incorrectly.

## Steps to Reproduce

1. Spawn multiple subagents in parallel (5+ workers)
2. Some workers will complete very quickly (< 5 seconds)
3. Some completion events will show `status: "timed out"` with 0 tokens
4. But the actual result content shows successful execution

Not easily reproducible on demand — it's a race condition that depends on timing.

## Expected Behavior

- Status should be "completed successfully" when the worker finishes before timeout
- Token counts should reflect actual usage
- Stats and status should be consistent with result content

## Actual Behavior

- Status shows "timed out" even when result content shows success
- Token counts show 0 despite having result content
- Stats and status are inconsistent with result

## Impact

| Area | Impact |
|------|--------|
| Functionality | Low — Results are still delivered correctly |
| Observability | High — Status reporting is unreliable |
| Monitoring | High — Can't trust timeout alerts |
| Debugging | Medium — False positives make debugging harder |

## Suggested Fixes

**Option 1: Atomic Status Update**
```typescript
// Before setting timeout status, check if already completed
if (run.status === 'running') {
  run.status = 'timeout';
}
```

**Option 2: Completion Timestamp Check**
```typescript
// If completion happened before timeout, use success status
if (run.completedAt && run.completedAt < timeoutFiredAt) {
  status = 'success';
} else {
  status = 'timeout';
}
```

**Option 3: Lock-Based Synchronization**
```typescript
// Use a lock/mutex when updating status
await run.lock.acquire();
try {
  if (run.status === 'running') {
    run.status = 'timeout';
  }
} finally {
  run.lock.release();
}
```

## Environment

- OpenClaw version: 2026.3.22
- Node.js: v22.22.0
- Platform: Linux (WSL2)
- Model provider: Alibaba Model Studio (via Anthropic-compatible endpoint)

## Related

- Investigation document: `docs/research/subagent-deep-root-cause-analysis.md` (in workspace)
- Test results: `docs/testing/subagent-bug-investigation.md` (in workspace)

## Additional Context

This was discovered during a comprehensive subagent test suite. The race condition appears to be intermittent — it affected workers 4 and 5 in a batch of 5 parallel workers, while workers 1-3 showed correct "completed successfully" status.

The issue is in the runtime outcome tracking mechanism, not in model output parsing. The result content is correctly captured, but the status field is set incorrectly due to the race.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Subagent timeout status race condition: workers show 'timed out' when successfully completed #53106

Bug Description

Evidence

Root Cause Analysis

Steps to Reproduce

Expected Behavior

Actual Behavior

Impact

Suggested Fixes

Environment

Related

Additional Context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Area	Impact
Functionality	Low — Results are still delivered correctly
Observability	High — Status reporting is unreliable
Monitoring	High — Can't trust timeout alerts
Debugging	Medium — False positives make debugging harder

Uh oh!

Subagent timeout status race condition: workers show 'timed out' when successfully completed #53106

Description

Bug Description

Evidence

Root Cause Analysis

Steps to Reproduce

Expected Behavior

Actual Behavior

Impact

Suggested Fixes

Environment

Related

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions