-
-
Notifications
You must be signed in to change notification settings - Fork 39.7k
Description
Summary
Gateway announce queue timeout (hardcoded at 60s) causes message memory loss during sub-agent announcement delivery. Messages successfully deliver to Telegram but are never written to the agent's session transcript, creating a split-brain where users see messages the agent has no memory of sending.
Environment
- OpenClaw Version: 2026.2.13
- Platform: macOS (local mode)
- Agent Configuration: 3 sub-agents (p-research, p-media, p-social)
Root Cause
When sub-agent announcements take longer than 60 seconds to process:
- ✅ Messages successfully deliver to Telegram (partial success)
- ❌ Transcript write-back fails (data loss)
- Result: External delivery succeeds, internal history recording fails
Evidence
Gateway Error Log (~/.openclaw/logs/gateway.err.log):
2026-02-15T01:30:08.587Z announce queue drain failed for agent:main:telegram:group:-1003810792546:topic:1: Error: gateway timeout after 60000ms
Gateway Delivery Log (~/.openclaw/logs/gateway.log):
2026-02-15T10:30:13.055+09:00 Good news — the sub-agent system is working now...
2026-02-15T10:30:19.871+09:00 Great — p-media just finished testing...
2026-02-15T10:30:27.550+09:00 Excellent — p-research just finished...
Session Transcript (~/.openclaw/agents/main/sessions/*.jsonl):
- All 5 messages from 10:30 AM: MISSING
- User's replies quoting those messages: PRESENT
- Agent's later acknowledgment of the issue: PRESENT
Steps to Reproduce
- Configure 3 sub-agents with announcement delivery to main agent
- Spawn all 3 simultaneously:
sessions_spawn({ task: "...", agentId: "p-research" }); sessions_spawn({ task: "...", agentId: "p-research" }); sessions_spawn({ task: "...", agentId: "p-media" });
- Wait for all to complete (~10-15 seconds)
- Check
gateway.err.logfor "announce queue drain failed" error - Verify session transcript missing announcement messages
- Verify messages were delivered to Telegram/channel
Impact
Severity: CRITICAL
- Agent loses conversation continuity
- Creates contradictions (agent says different things than what it actually sent)
- Cannot reference previous delegated work
- May repeat already-completed tasks
- Breaks user trust (agent genuinely doesn't know what it told user)
Example scenario:
- Agent sends: "Research complete: Microsoft Maia 200, DeepSeek-OCR 2, NVIDIA Jetson T4000"
- User asks: "What were those AI products you found?"
- Agent replies: "I don't have any record of finding AI products"
- User is confused/frustrated because they received the original message
Configuration Attempted
Tried adding configurable timeout via gateway.config.patch:
{"gateway": {"announceTimeoutMs": 120000}}Result: Rejected - "invalid config"
Conclusion: Gateway timeout is hardcoded in source, not user-configurable.
Recommended Fixes
Option 1: Increase Hardcoded Timeout (Quick Fix)
Change gateway announce timeout from 60s → 120s or higher in source code.
Pros: Simple, immediate relief
Cons: Still may fail with complex scenarios, doesn't address root issue
Option 2: Separate Delivery from Transcript Write (Recommended)
Decouple Telegram/channel delivery from session history write:
- If delivery succeeds but transcript fails → retry transcript write independently
- Don't let transcript write failure prevent successful delivery
- Add transcript write retry logic with exponential backoff
Pros: Proper fix, handles partial failures gracefully
Cons: Requires architectural change
Option 3: Make Timeout Configurable (Best Long-term)
Add gateway.announceTimeoutMs to config schema:
{
"gateway": {
"announceTimeoutMs": 120000
}
}Pros: Users can tune based on workload, no source code changes needed for adjustments
Cons: Doesn't fix root cause, just delays it
Option 4: Async Transcript Write
- Don't block announcement delivery on transcript write completion
- Queue transcript writes separately with independent retry logic
- Process transcript queue asynchronously
Pros: Eliminates blocking, improves reliability
Cons: Most complex implementation
Temporary Workarounds
Until fix is available:
- Serialize sub-agent spawns - don't batch multiple simultaneously
- Use sub-agents during active hours - user sees announcements in real-time
- Monitor gateway logs - watch for "announce queue drain failed" errors
- Verify delivery - ask user "Did you see the results?" after spawning
Additional Context
Lane wait warnings preceded the timeout, suggesting queue congestion:
2026-02-15T01:30:01.214Z [diagnostic] lane wait exceeded:
lane=session:agent:main:telegram:group:-1003810792546:topic:1
waitedMs=52602 queueAhead=0
This indicates the queue was under heavy load before the timeout occurred.
Related Files
Full bug report with detailed logs available at:
- Investigation log:
memory/2026-02-15.md - Comprehensive report:
BUG_REPORT_message_memory_loss.md
Can provide upon request.