Skip to content

Session mirror: garbage not filtered + word-merging/letter drops #29

@martymcenroe

Description

@martymcenroe

Problem

The session mirror (Tab 2, tail -f of logs/session-*.log) produces noisy, barely-readable output. Two distinct problems:

1. Garbage lines not filtered

The mirror's _log_to_mirror() has only 4 noise patterns in NOISE_RE plus a 12-char minimum length filter. Meanwhile clean_transcript.py has 95 garbage patterns that effectively remove TUI artifacts post-hoc. The mirror lets through all of the following:

Garbage type Example from live session
Thinking fragments g (thinking), n (thinking), ✶ s n (thinking)
Spinner activity lines Harmonizing… 1, ✢Harmonizing… 2, *Harmonizing… 5
Spinner activity with timing Harmonizing…(30s · ↓2.2k tokens)
Token/timing fragments 1 · 6.7k tokens, 2 s · 6.7k tokens, 15.1k tokens
Permission UI chrome Esctocancel·Tabtoamend·ctrl+etoexplain
"Do you want to proceed?" UI Doyouwanttoproceed?
Permission option repaints ❯1.Yes, 2.Yes,allowreadingfromProjects\fromthisproject
Agent tree lines ├─ Explore (Read core source files) · 10 tool uses
Agent initializing ⎿ Initializing…
"ctrl+b to run in background" ctrl+b to run in background
"ctrl+o to expand" +18 more tool uses (ctrl+o to expand)
Running N agents Running 2 gents…(ctrl+o to expand)
Status bar timestamps [02-1401:41:44] /c/Users/mcwiz/Projects/Hermes (main)
File count lines 2 s, reading 19 files…
Bash command labels Bash command
Bare "thought for Ns" (thought for 2s)

2. Word-merging and letter drops

The mirror_strip_ansi() cursor-tracking parser inserts spaces only when the cursor jumps past the current column position. This works for simple cases like:

\x1b[7;1HIt\x1b[7;4His  →  "It" + gap(3→4) + "is"  →  "It is"  ✓

But the Ink TUI renders many words character-by-character at adjacent columns with no gap:

\x1b[5;20HI\x1b[5;21H'\x1b[5;22Hl\x1b[5;23Hl\x1b[5;24Hr\x1b[5;25He\x1b[5;26Ha\x1b[5;27Hd
→ No column gaps → "I'llread" (correct per parser, wrong per English)

The parser cannot distinguish "adjacent chars in the same word" from "adjacent chars at the start of a new word" because the TUI uses the same positioning pattern for both.

Examples from live session:

Mirror output Should be
I'llreadthecodebase,issues,andwikiinparallel. I'll read the codebase, issues, and wiki in parallel.
Doyouwanttoproceed? Do you want to proceed?
ls-la/c/Users/mcwiz/Projects/HermesWiki/ ls -la /c/Users/mcwiz/Projects/HermesWiki/
2>/dev/null||echo"Nolocalwikiclone" 2>/dev/null || echo "No local wiki clone"
2 gents 2 agents (dropped letter)
loal local (dropped letter)
Lst List (dropped letter)
one-l esummris one-line summaries (dropped letters + wrong space)
un acked untracked (dropped letters + wrong space)
remoebranches remote branches (dropped letters + merged)
Sow recent session logs Show recent session logs (dropped letter)

Letter drops happen when the TUI repaints a line and the PTY read lands mid-repaint — some characters from the old render and some from the new render get interleaved, losing characters at the boundary.

Current filtering (insufficient)

# _log_to_mirror() filters:
NOISE_RE = [
    re.compile(r'^\s*$'),                        # blank lines
    re.compile(r'^[\u2500\u2550]{10,}$'),        # ─ or ═ horizontal rules
    re.compile(r'^\xb7\s+\S+ing'),               # · spinner lines (middle dot only)
    re.compile(r'^\s*\d+ files? '),              # file count status
]
# Plus: 12-char minimum, separator filter, timestamp fragment filter, 32-line dedup

Plan: B + D + F

B — Shared filter module (garbage filtering)

Extract GARBAGE_PATTERNS and is_garbage() from clean_transcript.py into a shared src/transcript_filters.py module. Both clean_transcript.py and unleashed-c-20.py import from it.

  • Single source of truth — add a pattern once, both scripts get it
  • No pattern drift — the mirror and the post-hoc cleaner always match
  • No performance concern — 95 compiled regexes against a line of text is microseconds. Python's re compiles to C. Not measurable even with 100 terminals open.

D — Accept word-merging in live mirror

The cursor-tracking parser is fundamentally limited by how the Ink TUI renders text. Word boundaries without column gaps are undetectable at the ANSI level. Accept merged words in the live mirror. Use clean_transcript.py --fix-spaces (wordninja) post-session for readable prose.

Word-merging in the live mirror is a separate problem tracked in its own issue — see "Related" below.

F — Rate-limit mirror writes

Buffer PTY output for 200-500ms before processing for the mirror. Larger chunks → fewer mid-repaint reads → fewer letter drops and less garbage.

  • Mirror becomes slightly delayed but it's tail -f — 200ms is invisible
  • Reduces both garbage volume AND letter drops cheaply
  • Combines naturally with B (filtered after buffering)

Implementation steps

  1. Create src/transcript_filters.py — extract GARBAGE_PATTERNS, COMPACTION_KEEP, is_garbage(), normalize_for_dedup() from clean_transcript.py
  2. Update clean_transcript.py to import from shared module
  3. Update unleashed-c-20.py _log_to_mirror() to use is_garbage() instead of NOISE_RE
  4. Add 200ms write buffer in _log_to_mirror() (accumulate data, flush on timer)
  5. Test: run a session and compare mirror output before/after

Related

  • src/clean_transcript.py — 95 garbage patterns, wordninja integration
  • docs/runbooks/0908-transcript-cleaning.md — post-session cleaning workflow
  • Separate issue needed for live word-splitting (wordninja or alternative in the mirror itself)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions