Skip to content

fix(cli, fact-checker): reconfigure stdio to UTF-8 on Windows#1282

Merged
igorls merged 4 commits intoMemPalace:developfrom
mvalentsev:fix/fact-checker-stdio-utf8
May 6, 2026
Merged

fix(cli, fact-checker): reconfigure stdio to UTF-8 on Windows#1282
igorls merged 4 commits intoMemPalace:developfrom
mvalentsev:fix/fact-checker-stdio-utf8

Conversation

@mvalentsev
Copy link
Copy Markdown
Contributor

@mvalentsev mvalentsev commented Apr 30, 2026

What

Reconfigure stdin/stdout/stderr to UTF-8 on Windows in two entry points, with a per-stream errors policy that matches what each one writes:

  • mempalace/cli.py:main() -- the primary mempalace console_script
  • mempalace/fact_checker.py:__main__ -- python -m mempalace.fact_checker --stdin

Per-stream policy:

  • stdin -- surrogateescape: a malformed byte from a redirected file (or a misbehaving caller) becomes a lone surrogate the consumer's parser surfaces, instead of UnicodeDecodeError killing the read on the first bad byte.
  • stdout -- replace: mempalace search and the fact_checker --stdin path both print verbatim drawer / fact text. A drawer that round-tripped a filename through surrogateescape can carry a lone surrogate; strict would UnicodeEncodeError mid-print and lose the rest of the result block. replace substitutes U+FFFD instead and the result still renders.
  • stderr -- replace: same hazard for warning lines that quote user-supplied paths.

Why

On Windows, Python defaults stdio to the system ANSI codepage (cp1252/cp1251/cp950 depending on locale). That mojibakes non-ASCII content at the process boundary -- a hard bug to debug because verbatim drawer text gets corrupted in pipes, and arguments / interactive input read through input() come back garbled.

After auditing every stdio entry point on develop, three user-facing console_scripts / module invocations route non-ASCII content through sys.stdin / sys.stdout:

After this PR all three of the package's user-facing stdio entry points reconfigure identically on Windows.

mempalace/cli.py:main()

The primary CLI dispatches to subcommands that print verbatim drawer text and wing/room names (mempalace search, mempalace status, mempalace wake-up) and read non-ASCII names via input() through interactive flows (mempalace init -> onboarding -> entity / room detectors).

Concrete failure modes:

  • mempalace search "..." > out.txt -- piped stdout mojibakes drawer text containing Cyrillic / CJK
  • mempalace ... < input.txt -- piped stdin mojibakes non-ASCII content before subcommand sees it
  • Interactive flows on a non-Latin Windows console fall back to ANSI codepage when stdin is redirected (CI scripts, test harnesses)

The reconfigure cascades to every subcommand because sys.stdin / sys.stdout are the same module-global streams that cmd_init, cmd_search, cmd_status, cmd_hook, etc. inherit.

mempalace/fact_checker.py:__main__

fact_checker.py:325 calls sys.stdin.read() from the __main__ block when invoked as python -m mempalace.fact_checker --stdin. Same Windows codepage failure mode -- non-ASCII fact text comes back as mojibake before pattern parsing sees it. Low-traffic CLI utility, fixed for sweep consistency rather than in response to a user-filed bug.

How

Shared helper in mempalace/_stdio.py:

def reconfigure_stdio_utf8_on_windows(
    *,
    stdin_errors: str = "surrogateescape",
    stdout_errors: str = "strict",
    stderr_errors: str = "strict",
    on_failure: Callable[[str, BaseException], None] | None = None,
) -> None:
    ...

No-op off Windows. Each stream's reconfigure is wrapped in try/except so a replaced stream (Jupyter, test harness) routes through the on_failure callback (defaults to a WARNING: line on sys.stderr) and continues rather than crashing the entry point.

cli.py and fact_checker.py ship thin wrappers that pass the CLI policy (stdout_errors="replace", stderr_errors="replace"); the MCP-side reconfigure (#400) shares the same helper with its strict policy via the same module. The thin wrappers preserve the existing _reconfigure_stdio_utf8_on_windows() import surface so existing tests stay shape-compatible.

Tests

tests/test_cli.py:

  • test_reconfigures_stdio_to_utf8_on_windows -- patches sys.platform = "win32" plus a ReconfigurableStringIO for each stream; asserts each received the right per-stream reconfigure(encoding="utf-8", errors=...) exactly once (stdin=surrogateescape, stdout/stderr=replace).
  • test_reconfigure_stdio_is_noop_off_windows -- patches sys.platform = "linux"; asserts no reconfigure call.

tests/test_fact_checker.py::TestCLI:

  • Same two tests against fact_checker._reconfigure_stdio_utf8_on_windows.

Local run: 83 passed (cli + fact_checker suites). ruff check . and ruff format --check . clean.

Out of scope

  • Existing CLI / fact_checker detection logic is unchanged.
  • No changes to imports, public API, or behavior off Windows.
  • File I/O sites with open() / Path.read_text() lacking explicit encoding="utf-8" are a separate bug class (mojibake on file content, not stdio) and would warrant their own audit.
  • Internal utility scripts invoked as python -m mempalace.<module> for development (dialect, diary_ingest, repair, spellcheck, etc.) are not in this sweep -- they are reached through mempalace ... subcommands which now reconfigure at cli.py:main() and inherit the UTF-8 streams.

Body updated 2026-05-03 to match landed code: 03643eb switched the errors policy from blanket strict to per-stream (stdin=surrogateescape, stdout/stderr=replace) so a redirected file with bad bytes does not crash the read and a drawer carrying a surrogate half from a filename round-trip does not crash mid-print; 285b3b4 extracted the loop into mempalace/_stdio.py so the CLI / fact_checker / mcp_server entry points share one helper instead of carrying duplicate copies.

@mvalentsev mvalentsev changed the title fix(fact-checker): reconfigure stdio to UTF-8 on Windows fix(cli, fact-checker): reconfigure stdio to UTF-8 on Windows Apr 30, 2026
@mvalentsev mvalentsev marked this pull request as ready for review April 30, 2026 10:13
@igorls igorls added area/cli CLI commands area/i18n Multilingual, Unicode, non-English embeddings area/windows Windows-specific bugs and compatibility bug Something isn't working labels May 2, 2026
@igorls igorls added this to the v3.3.5 milestone May 2, 2026
mvalentsev added 3 commits May 3, 2026 21:33
The `python -m mempalace.fact_checker --stdin` entry point reads non-ASCII
text through the system ANSI codepage (cp1252/cp1251/cp950) on Windows,
which mojibakes characters before claim-extraction sees them. Reconfigure
stdin/stdout/stderr to UTF-8 with `errors="strict"`, wrapped in try/except
so a replaced stream (Jupyter, test harness) logs a warning rather than
crashing the CLI.

Mirrors the same fix shipped for `mcp_server.py:main()` (MemPalace#400) and
`hooks_cli.py:run_hook()` (MemPalace#1280) -- this is the third and last
stdin-reading entry point in the package.
The primary `mempalace` console_script (`cli.py:main()`) reads non-ASCII
arguments via piped stdin and writes verbatim drawer text / wing names
through `print()`. On Windows, Python defaults stdio to the system ANSI
codepage (cp1252/cp1251/cp950), so:

- `mempalace search "..." > out.txt` mojibakes any drawer text containing
  non-Latin characters
- `mempalace ... < input.txt` mojibakes piped non-ASCII input

Reconfigure stdin/stdout/stderr to UTF-8 (`errors="strict"`) at the top
of `main()`, mirroring the helper added in this PR for fact_checker's
`__main__` block. Wrapped in try/except so a replaced stream (Jupyter,
test harness) logs a warning and continues rather than crashing the CLI.

The reconfigure cascades through every `mempalace` subcommand
(`init`/`mine`/`search`/`status`/`hook`/etc.) and through the interactive
flows that read non-ASCII names via `input()` (onboarding, entity
detector, room detector). With this commit the package's three
user-facing entry points (`mempalace`, `mempalace-mcp`, and
`python -m mempalace.fact_checker`) all reconfigure stdio identically on
Windows.
Previously all three streams reconfigured to UTF-8 with errors='strict'.
That kills 'mempalace search' the moment a drawer carrying a surrogate
half (round-tripped from a filename via surrogateescape) hits print(),
losing the rest of the result block. Same hazard for warning lines on
stderr.

Split the policy:
  stdin  -> surrogateescape (malformed bytes from a redirected file
            survive as lone surrogates instead of crashing the read)
  stdout -> replace (drawer text with a stray surrogate becomes U+FFFD
            instead of UnicodeEncodeError mid-print)
  stderr -> replace (same protection for logger / warning paths)

Applied identically in the cli.py and fact_checker.py helpers; the DRY
extraction into a shared module is a separate cleanup ask, kept out of
this fix to keep the diff narrow.

Tests updated for the new per-stream assertion.
@mvalentsev mvalentsev force-pushed the fix/fact-checker-stdio-utf8 branch from 187af0f to 03643eb Compare May 3, 2026 16:37
Both cli.py and fact_checker.py carried identical 28-line Windows stdio
reconfigure helpers; pull the loop into mempalace/_stdio.py so the same
machine drives the CLI, the fact_checker --stdin entry point, and the
MCP server. The thin per-call-site wrappers stay so existing tests keep
importing _reconfigure_stdio_utf8_on_windows from the same module they
always have.

CLI / fact_checker policy unchanged: stdin=surrogateescape (don't crash
on a malformed redirected file), stdout/stderr=replace (don't crash
mid-print on a surrogate half round-tripped from a filename).
@igorls igorls merged commit 3824ea6 into MemPalace:develop May 6, 2026
6 checks passed
igorls added a commit that referenced this pull request May 6, 2026
…1282 #1167 #1160

Bundled CHANGELOG entries for the seven Tier-1 PRs merged today, including
the behavior-change call-out for #1167 (KG date validators now reject
non-ISO inputs that previously produced silent empty results).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/cli CLI commands area/i18n Multilingual, Unicode, non-English embeddings area/windows Windows-specific bugs and compatibility bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants