fix(cli, fact-checker): reconfigure stdio to UTF-8 on Windows#1282
Merged
igorls merged 4 commits intoMemPalace:developfrom May 6, 2026
Merged
fix(cli, fact-checker): reconfigure stdio to UTF-8 on Windows#1282igorls merged 4 commits intoMemPalace:developfrom
igorls merged 4 commits intoMemPalace:developfrom
Conversation
3 tasks
This was referenced May 2, 2026
The `python -m mempalace.fact_checker --stdin` entry point reads non-ASCII text through the system ANSI codepage (cp1252/cp1251/cp950) on Windows, which mojibakes characters before claim-extraction sees them. Reconfigure stdin/stdout/stderr to UTF-8 with `errors="strict"`, wrapped in try/except so a replaced stream (Jupyter, test harness) logs a warning rather than crashing the CLI. Mirrors the same fix shipped for `mcp_server.py:main()` (MemPalace#400) and `hooks_cli.py:run_hook()` (MemPalace#1280) -- this is the third and last stdin-reading entry point in the package.
The primary `mempalace` console_script (`cli.py:main()`) reads non-ASCII arguments via piped stdin and writes verbatim drawer text / wing names through `print()`. On Windows, Python defaults stdio to the system ANSI codepage (cp1252/cp1251/cp950), so: - `mempalace search "..." > out.txt` mojibakes any drawer text containing non-Latin characters - `mempalace ... < input.txt` mojibakes piped non-ASCII input Reconfigure stdin/stdout/stderr to UTF-8 (`errors="strict"`) at the top of `main()`, mirroring the helper added in this PR for fact_checker's `__main__` block. Wrapped in try/except so a replaced stream (Jupyter, test harness) logs a warning and continues rather than crashing the CLI. The reconfigure cascades through every `mempalace` subcommand (`init`/`mine`/`search`/`status`/`hook`/etc.) and through the interactive flows that read non-ASCII names via `input()` (onboarding, entity detector, room detector). With this commit the package's three user-facing entry points (`mempalace`, `mempalace-mcp`, and `python -m mempalace.fact_checker`) all reconfigure stdio identically on Windows.
Previously all three streams reconfigured to UTF-8 with errors='strict'.
That kills 'mempalace search' the moment a drawer carrying a surrogate
half (round-tripped from a filename via surrogateescape) hits print(),
losing the rest of the result block. Same hazard for warning lines on
stderr.
Split the policy:
stdin -> surrogateescape (malformed bytes from a redirected file
survive as lone surrogates instead of crashing the read)
stdout -> replace (drawer text with a stray surrogate becomes U+FFFD
instead of UnicodeEncodeError mid-print)
stderr -> replace (same protection for logger / warning paths)
Applied identically in the cli.py and fact_checker.py helpers; the DRY
extraction into a shared module is a separate cleanup ask, kept out of
this fix to keep the diff narrow.
Tests updated for the new per-stream assertion.
187af0f to
03643eb
Compare
Both cli.py and fact_checker.py carried identical 28-line Windows stdio reconfigure helpers; pull the loop into mempalace/_stdio.py so the same machine drives the CLI, the fact_checker --stdin entry point, and the MCP server. The thin per-call-site wrappers stay so existing tests keep importing _reconfigure_stdio_utf8_on_windows from the same module they always have. CLI / fact_checker policy unchanged: stdin=surrogateescape (don't crash on a malformed redirected file), stdout/stderr=replace (don't crash mid-print on a surrogate half round-tripped from a filename).
4 tasks
2 tasks
4 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Reconfigure stdin/stdout/stderr to UTF-8 on Windows in two entry points, with a per-stream
errorspolicy that matches what each one writes:mempalace/cli.py:main()-- the primarymempalaceconsole_scriptmempalace/fact_checker.py:__main__--python -m mempalace.fact_checker --stdinPer-stream policy:
surrogateescape: a malformed byte from a redirected file (or a misbehaving caller) becomes a lone surrogate the consumer's parser surfaces, instead ofUnicodeDecodeErrorkilling the read on the first bad byte.replace:mempalace searchand the fact_checker--stdinpath both print verbatim drawer / fact text. A drawer that round-tripped a filename throughsurrogateescapecan carry a lone surrogate;strictwouldUnicodeEncodeErrormid-print and lose the rest of the result block.replacesubstitutes U+FFFD instead and the result still renders.replace: same hazard for warning lines that quote user-supplied paths.Why
On Windows, Python defaults stdio to the system ANSI codepage (cp1252/cp1251/cp950 depending on locale). That mojibakes non-ASCII content at the process boundary -- a hard bug to debug because verbatim drawer text gets corrupted in pipes, and arguments / interactive input read through
input()come back garbled.After auditing every stdio entry point on
develop, three user-facing console_scripts / module invocations route non-ASCII content throughsys.stdin/sys.stdout:mempalace/mcp_server.py:main()-- already fixed in fix: reconfigure MCP stdin/stdout to UTF-8 on Windows (fixes #363) #400mempalace/hooks_cli.py:run_hook()-- already fixed in Fix/windows hook stdio utf8 #1280mempalace/cli.py:main()andmempalace/fact_checker.py:__main__-- this PRAfter this PR all three of the package's user-facing stdio entry points reconfigure identically on Windows.
mempalace/cli.py:main()The primary CLI dispatches to subcommands that print verbatim drawer text and wing/room names (
mempalace search,mempalace status,mempalace wake-up) and read non-ASCII names viainput()through interactive flows (mempalace init-> onboarding -> entity / room detectors).Concrete failure modes:
mempalace search "..." > out.txt-- piped stdout mojibakes drawer text containing Cyrillic / CJKmempalace ... < input.txt-- piped stdin mojibakes non-ASCII content before subcommand sees itThe reconfigure cascades to every subcommand because
sys.stdin/sys.stdoutare the same module-global streams thatcmd_init,cmd_search,cmd_status,cmd_hook, etc. inherit.mempalace/fact_checker.py:__main__fact_checker.py:325callssys.stdin.read()from the__main__block when invoked aspython -m mempalace.fact_checker --stdin. Same Windows codepage failure mode -- non-ASCII fact text comes back as mojibake before pattern parsing sees it. Low-traffic CLI utility, fixed for sweep consistency rather than in response to a user-filed bug.How
Shared helper in
mempalace/_stdio.py:No-op off Windows. Each stream's reconfigure is wrapped in try/except so a replaced stream (Jupyter, test harness) routes through the
on_failurecallback (defaults to aWARNING:line onsys.stderr) and continues rather than crashing the entry point.cli.pyandfact_checker.pyship thin wrappers that pass the CLI policy (stdout_errors="replace",stderr_errors="replace"); the MCP-side reconfigure (#400) shares the same helper with its strict policy via the same module. The thin wrappers preserve the existing_reconfigure_stdio_utf8_on_windows()import surface so existing tests stay shape-compatible.Tests
tests/test_cli.py:test_reconfigures_stdio_to_utf8_on_windows-- patchessys.platform = "win32"plus aReconfigurableStringIOfor each stream; asserts each received the right per-streamreconfigure(encoding="utf-8", errors=...)exactly once (stdin=surrogateescape, stdout/stderr=replace).test_reconfigure_stdio_is_noop_off_windows-- patchessys.platform = "linux"; asserts no reconfigure call.tests/test_fact_checker.py::TestCLI:fact_checker._reconfigure_stdio_utf8_on_windows.Local run: 83 passed (cli + fact_checker suites).
ruff check .andruff format --check .clean.Out of scope
fact_checkerdetection logic is unchanged.open()/Path.read_text()lacking explicitencoding="utf-8"are a separate bug class (mojibake on file content, not stdio) and would warrant their own audit.python -m mempalace.<module>for development (dialect, diary_ingest, repair, spellcheck, etc.) are not in this sweep -- they are reached throughmempalace ...subcommands which now reconfigure atcli.py:main()and inherit the UTF-8 streams.Body updated 2026-05-03 to match landed code:
03643ebswitched theerrorspolicy from blanketstrictto per-stream (stdin=surrogateescape, stdout/stderr=replace) so a redirected file with bad bytes does not crash the read and a drawer carrying a surrogate half from a filename round-trip does not crash mid-print;285b3b4extracted the loop intomempalace/_stdio.pyso the CLI / fact_checker / mcp_server entry points share one helper instead of carrying duplicate copies.