Skip to content

Session stall diagnostics: process alive but no stream-json events #97

@nathanschram

Description

Problem

In Untether v0.34.0 production, Claude Code sessions freeze — the subprocess stays alive but stops producing stream-json stdout events on the pipe. The stall monitor detects it after 5 min but has no diagnostic info, making root cause analysis impossible from logs alone.

Observed failure modes:

  1. Triage: 2 parallel Agent subagents → lost ALL TCP → 81% CPU, zero TCP, no stdout for 30+ min
  2. Auditor-toolkit: Bash finished → Claude sleeping with 1 ESTABLISHED TCP → no stdout for 10+ min
  3. Triage (resumed): Same session resumed → immediately stalled again (tainted context)

Current gaps:

  • Stall monitor logs only elapsed time — no process state, TCP, last action
  • Subprocess watchdog only detects dead processes, not "alive but stalled"
  • No auto-recovery mechanism
  • No event timeline context for post-mortem analysis
  • stderr captured but not accessible during stalls

Solution

Rich diagnostics on every stall for post-mortem, progressive warnings, and safe auto-recovery:

  1. Process diagnostics module (proc_diag.py) — /proc/{pid}/ reads for CPU, memory, TCP, FDs, children
  2. Event tracking on JsonlStreamState — timestamp, type, tool name, ring buffer of recent events (all engines)
  3. PID injection into StartedEvent meta — base class handles all engines automatically
  4. Stderr ring bufferstream.stderr_capture accessible from watchdog and stall monitor
  5. Progressive stall monitor — repeating warnings with fresh /proc diagnostics each time
  6. Liveness watchdog — 10 min timeout for "alive but silent" with optional auto-kill (zero TCP + zero CPU)
  7. Session completion summary — one-line log for post-mortem pattern analysis
  8. Watchdog config[watchdog] section: liveness_timeout, stall_auto_kill, stall_repeat_seconds

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions