Skip to content

Gateway timeout causes message memory loss during sub-agent announcement delivery #16729

@Clawdette-Workspace

Description

@Clawdette-Workspace

Summary

Gateway announce queue timeout (hardcoded at 60s) causes message memory loss during sub-agent announcement delivery. Messages successfully deliver to Telegram but are never written to the agent's session transcript, creating a split-brain where users see messages the agent has no memory of sending.

Environment

  • OpenClaw Version: 2026.2.13
  • Platform: macOS (local mode)
  • Agent Configuration: 3 sub-agents (p-research, p-media, p-social)

Root Cause

When sub-agent announcements take longer than 60 seconds to process:

  1. ✅ Messages successfully deliver to Telegram (partial success)
  2. ❌ Transcript write-back fails (data loss)
  3. Result: External delivery succeeds, internal history recording fails

Evidence

Gateway Error Log (~/.openclaw/logs/gateway.err.log):

2026-02-15T01:30:08.587Z announce queue drain failed for agent:main:telegram:group:-1003810792546:topic:1: Error: gateway timeout after 60000ms

Gateway Delivery Log (~/.openclaw/logs/gateway.log):

2026-02-15T10:30:13.055+09:00 Good news — the sub-agent system is working now...
2026-02-15T10:30:19.871+09:00 Great — p-media just finished testing...
2026-02-15T10:30:27.550+09:00 Excellent — p-research just finished...

Session Transcript (~/.openclaw/agents/main/sessions/*.jsonl):

  • All 5 messages from 10:30 AM: MISSING
  • User's replies quoting those messages: PRESENT
  • Agent's later acknowledgment of the issue: PRESENT

Steps to Reproduce

  1. Configure 3 sub-agents with announcement delivery to main agent
  2. Spawn all 3 simultaneously:
    sessions_spawn({ task: "...", agentId: "p-research" });
    sessions_spawn({ task: "...", agentId: "p-research" });
    sessions_spawn({ task: "...", agentId: "p-media" });
  3. Wait for all to complete (~10-15 seconds)
  4. Check gateway.err.log for "announce queue drain failed" error
  5. Verify session transcript missing announcement messages
  6. Verify messages were delivered to Telegram/channel

Impact

Severity: CRITICAL

  • Agent loses conversation continuity
  • Creates contradictions (agent says different things than what it actually sent)
  • Cannot reference previous delegated work
  • May repeat already-completed tasks
  • Breaks user trust (agent genuinely doesn't know what it told user)

Example scenario:

  • Agent sends: "Research complete: Microsoft Maia 200, DeepSeek-OCR 2, NVIDIA Jetson T4000"
  • User asks: "What were those AI products you found?"
  • Agent replies: "I don't have any record of finding AI products"
  • User is confused/frustrated because they received the original message

Configuration Attempted

Tried adding configurable timeout via gateway.config.patch:

{"gateway": {"announceTimeoutMs": 120000}}

Result: Rejected - "invalid config"

Conclusion: Gateway timeout is hardcoded in source, not user-configurable.

Recommended Fixes

Option 1: Increase Hardcoded Timeout (Quick Fix)

Change gateway announce timeout from 60s → 120s or higher in source code.

Pros: Simple, immediate relief
Cons: Still may fail with complex scenarios, doesn't address root issue

Option 2: Separate Delivery from Transcript Write (Recommended)

Decouple Telegram/channel delivery from session history write:

  • If delivery succeeds but transcript fails → retry transcript write independently
  • Don't let transcript write failure prevent successful delivery
  • Add transcript write retry logic with exponential backoff

Pros: Proper fix, handles partial failures gracefully
Cons: Requires architectural change

Option 3: Make Timeout Configurable (Best Long-term)

Add gateway.announceTimeoutMs to config schema:

{
  "gateway": {
    "announceTimeoutMs": 120000
  }
}

Pros: Users can tune based on workload, no source code changes needed for adjustments
Cons: Doesn't fix root cause, just delays it

Option 4: Async Transcript Write

  • Don't block announcement delivery on transcript write completion
  • Queue transcript writes separately with independent retry logic
  • Process transcript queue asynchronously

Pros: Eliminates blocking, improves reliability
Cons: Most complex implementation

Temporary Workarounds

Until fix is available:

  1. Serialize sub-agent spawns - don't batch multiple simultaneously
  2. Use sub-agents during active hours - user sees announcements in real-time
  3. Monitor gateway logs - watch for "announce queue drain failed" errors
  4. Verify delivery - ask user "Did you see the results?" after spawning

Additional Context

Lane wait warnings preceded the timeout, suggesting queue congestion:

2026-02-15T01:30:01.214Z [diagnostic] lane wait exceeded: 
  lane=session:agent:main:telegram:group:-1003810792546:topic:1 
  waitedMs=52602 queueAhead=0

This indicates the queue was under heavy load before the timeout occurred.

Related Files

Full bug report with detailed logs available at:

  • Investigation log: memory/2026-02-15.md
  • Comprehensive report: BUG_REPORT_message_memory_loss.md

Can provide upon request.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions