-
-
Notifications
You must be signed in to change notification settings - Fork 69.4k
[Bug]: Config-change-triggered restart fails via launchctl (ETIMEDOUT), gateway enters degraded state, memory indexes stuck at zero for hours #36822
Description
Bug type
Behavior bug (incorrect output/state without crash)
Summary
When a config change triggers an automatic gateway restart on macOS (LaunchAgent), the spawnSync launchctl call can time out. The fallback in-process restart also times out ("shutdown timed out; exiting without full cleanup"). The gateway continues running in a degraded state where memory reindex commands (openclaw memory index) complete successfully but indexes don't actually recover — leaving agent memory indexes at zero for hours until the process eventually self-heals.
Steps to reproduce
- Run OpenClaw gateway via macOS LaunchAgent (ai.openclaw.gateway)
- Have multiple agents with large memory indexes (1000+ files across 5 agents)
- Make a config change that requires gateway restart (e.g., ACP settings: acp.enabled, acp.defaultAgent, acp.stream)
- Gateway detects config change, sends SIGUSR1, attempts spawnSync launchctl kickstart
- Spawn times out → falls back to in-process restart → shutdown also times out
Expected behavior
Gateway should cleanly restart. If the primary restart mechanism fails, the fallback should either: - Successfully complete the in-process restart, OR - Exit cleanly so LaunchAgent's KeepAlive restarts a fresh process
Actual behavior
Gateway enters a degraded state where: - Telegram message routing continues working (agents can chat) - Memory indexer is stuck — openclaw memory index CLI reports success but indexes don't update - Agent memory indexes drop to 0/N or freeze at stale counts - Watchdog monitoring detects the issue but repeated reindex attempts don't fix it - Self-heals after ~12-19 hours (likely when LaunchAgent eventually restarts the process)
OpenClaw version
2026.3.2
Operating system
macOS (Darwin 25.3.0, arm64, M1 Pro)
Install method
npm global
Logs, screenshots, and evidence
2026-03-04T04:18:16.943Z [reload] config change requires gateway restart (acp)
2026-03-04T04:18:19.023Z [gateway] full process restart failed (spawnSync launchctl ETIMEDOUT); falling back to in-process restart
2026-03-04T04:18:20.010Z [reload] config change requires gateway restart (acp.defaultAgent)
2026-03-04T04:19:01.170Z [gateway] shutdown timed out; exiting without full cleanup
Timeline: - 04:18 — Failed restart. Gateway degraded. - 07:10 — First memory index gap detected (indexed frozen, total growing) - 11:07 — One agent's index drops to ZERO. Stays at zero despite reindex every 30 min. - 23:19 — Self-healed (~19 hours later)Impact and severity
Affected: Multi-agent deployments on macOS (LaunchAgent) with large memory indexes (1000+ files across 5+ agents)
Severity: High — one agent's memory index stuck at zero for 12 hours, blocking all memory-dependent functionality
Frequency: Triggered once during embedding provider migration (config change → failed restart). Likely reproducible whenever a config change requiring restart occurs during heavy indexing.
Consequence: Agent loses all memory search capability. External watchdog reindex cannot recover while gateway is in degraded state. Self-heals only after eventual clean process restart (~12-19 hours).
Additional information
Additional Environment Information:
Gateway: LaunchAgent (ai.openclaw.gateway)
Memory provider: Ollama (nomic-embed-text, 768-dim)
Agents: 5 (main + 4 staff)
Total indexed files: ~1,300 across all agents
Related Issues:
#10600 — Desktop app triggers periodic SIGUSR1 restarts (same SIGUSR1 → restart path)
#24279 / #24301 — PID mismatch in restart health check (restart mechanism bugs)
#32048 — "shutdown timed out; exiting without full cleanup" (same error, different trigger)
#32277 — Memory indexer doesn't detect embedding dimension mismatch (related — we were mid-migration)
#11728 — runSafeReindex partial reindex after DB deletion (related reindex bug)
#3573 — openclaw status always reports memory as dirty (makes debugging harder)
Workaround:
External memory watchdog (LaunchAgent, runs every 30 min) that detects index drops and runs openclaw memory index. Works for normal cases but cannot recover from the degraded gateway state described here. Adding logic to detect repeated failures and trigger a full process restart (via launchctl kickstart -k) would mitigate this.
Suggested Fix:
- The in-process restart fallback should force-exit the process (e.g., process.exit(1)) after shutdown timeout, rather than continuing in a degraded state. Let LaunchAgent/systemd handle the restart cleanly.
- Consider adding a health check that detects "indexes stuck" state and triggers self-restart.