-
Notifications
You must be signed in to change notification settings - Fork 2
Closed
Labels
enhancementNew feature or requestNew feature or request
Milestone
Description
Problem
In Untether v0.34.0 production, Claude Code sessions freeze — the subprocess stays alive but stops producing stream-json stdout events on the pipe. The stall monitor detects it after 5 min but has no diagnostic info, making root cause analysis impossible from logs alone.
Observed failure modes:
- Triage: 2 parallel Agent subagents → lost ALL TCP → 81% CPU, zero TCP, no stdout for 30+ min
- Auditor-toolkit: Bash finished → Claude sleeping with 1 ESTABLISHED TCP → no stdout for 10+ min
- Triage (resumed): Same session resumed → immediately stalled again (tainted context)
Current gaps:
- Stall monitor logs only elapsed time — no process state, TCP, last action
- Subprocess watchdog only detects dead processes, not "alive but stalled"
- No auto-recovery mechanism
- No event timeline context for post-mortem analysis
- stderr captured but not accessible during stalls
Solution
Rich diagnostics on every stall for post-mortem, progressive warnings, and safe auto-recovery:
- Process diagnostics module (
proc_diag.py) —/proc/{pid}/reads for CPU, memory, TCP, FDs, children - Event tracking on JsonlStreamState — timestamp, type, tool name, ring buffer of recent events (all engines)
- PID injection into StartedEvent meta — base class handles all engines automatically
- Stderr ring buffer —
stream.stderr_captureaccessible from watchdog and stall monitor - Progressive stall monitor — repeating warnings with fresh
/procdiagnostics each time - Liveness watchdog — 10 min timeout for "alive but silent" with optional auto-kill (zero TCP + zero CPU)
- Session completion summary — one-line log for post-mortem pattern analysis
- Watchdog config —
[watchdog]section:liveness_timeout,stall_auto_kill,stall_repeat_seconds
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request