Skip to content

fix: improve Windows compatibility — enforce UTF-8 encoding#162

Merged
memtomem merged 4 commits intomemtomem:mainfrom
powerzist:develop
Apr 19, 2026
Merged

fix: improve Windows compatibility — enforce UTF-8 encoding#162
memtomem merged 4 commits intomemtomem:mainfrom
powerzist:develop

Conversation

@powerzist
Copy link
Copy Markdown
Collaborator

@powerzist powerzist commented Apr 16, 2026

Problem

Although memtomem uses UTF-8 encoding, Windows uses cp949 as the default encoding, which causes encoding conflicts.
Below are examples of actual errors that occurred:

self = WindowsPath('C:/Users/user/AppData/Local/Temp/pytest-of-user/pytest-30/test_edited_file_is_reindexed_0/fake_home/.claude/projects/-Users-test-Work-demo-project/memory/feedback_a.md')
data = '# Feedback A\n\nRevised guidance: prefer bge-m3 for multilingual corpora — the Korean coverage is measurably better than bge-small, and the English quality is comparable.\n'
encoding = 'locale', errors = None, newline = None

    def write_text(self, data, encoding=None, errors=None, newline=None):
        """
        Open the file in text mode, write to it, and close the file.
        """
        if not isinstance(data, str):
            raise TypeError('data must be str, not %s' %
                            data.__class__.__name__)
        with self.open(mode='w', encoding=encoding, errors=errors, newline=newline) as f:
>           return f.write(data)
                   ^^^^^^^^^^^^^
E           UnicodeEncodeError: 'cp949' codec can't encode character '\u2014' in position 73: illegal multibyte sequence

..\..\AppData\Roaming\uv\python\cpython-3.13.5-windows-x86_64-none\Lib\pathlib\_abc.py:652: UnicodeEncodeError

This is a typical encoding conflict, and the error message clearly indicates that the issue is caused by an encoding mismatch.

self = <tests.test_e2e_pipeline.TestCrossLanguageE2E object at 0x0000021546BC6490>
components = Components(config=Mem2MemConfig(embedding=EmbeddingConfig(provider='none', model='bge-m3', dimension=1024, base_url=''... 0x0000021546BCFA50>, search_pipeline=<memtomem.search.pipeline.SearchPipeline object at 0x00000215477C2B30>, llm=None)
memory_dir = WindowsPath('C:/Users/user/AppData/Local/Temp/pytest-of-user/pytest-30/test_mixed_language_context0/memories')

    async def test_mixed_language_context(self, components, memory_dir):
        """Mixed-language doc with context expansion preserves both languages."""
        doc = memory_dir / "mixed.md"
        doc.write_text(
            "## Project Overview\n\n"
            "Building a knowledge management platform.\n\n"
            "## 기술 스택\n\n"
            "Python 3.12, FastAPI, SQLite with FTS5.\n\n"
            "## Architecture\n\n"
            "Monorepo with uv workspace. Two packages.\n"
        )
        await components.index_engine.index_file(doc)

        results, _ = await components.search_pipeline.search("기술 스택", top_k=1, context_window=1)
>       assert len(results) >= 1
E       assert 0 >= 1
E        +  where 0 = len([])

packages\memtomem\tests\test_e2e_pipeline.py:185: AssertionError

In this case, the immediate cause of the error is a search failure, but the underlying cause of that failure is an encoding mismatch. Therefore, this can also ultimately be considered an encoding conflict issue.

To address the various issues caused by this and reduce the likelihood of similar problems in the future, we discussed enforcing UTF-8 by adding encoding="utf-8" to read_text() and write_text().

Conclusion

Enforcing UTF-8 by adding encoding="utf-8" to every read_text() and write_text() call is not necessarily the cleanest approach and does have some drawbacks. However, at this point, it appears to be the fastest practical way to resolve the related issues.

For that reason, the code is being updated in this way for now, with the understanding that this approach may be revisited and changed at any time through further discussion.

@memtomem
Copy link
Copy Markdown
Owner

Welcome @powerzist, and thanks for the contribution!

On the encoding="utf-8" changes — these are great. Python's Path.write_text/read_text default to locale.getencoding(), which is typically cp1252 on Windows (or cp949 on Korean Windows, etc.), so non-ASCII content can round-trip incorrectly. Making the encoding explicit is the right call for every line you touched. Please keep all of these.

On the mkdir(exist_ok=True) changes — most should be reverted.

The pytest tmp_path fixture creates a fresh directory per test function — it's guaranteed empty, so the directories you're creating under it can't already exist. Adding exist_ok=True to those calls is defensive-coding noise that obscures the intent: a future reader will think "this directory might already exist? from where?" and waste time chasing a non-existent race.

Concretely, every mkdir(...) call in this PR is under tmp_path (or a fixture-created derivative like claude_dir, codex_home, fake_home that's itself rooted in tmp_path). For example:

  • conftest.py:57(tmp_path / "memories").mkdir(exist_ok=True) — fresh dir, no collision possible
  • test_cli_ingest.py:148, 158, 166, 176, ... — all tmp_path-rooted
  • test_context.py, test_context_agents.py, ... — same pattern throughout

Could you revert all the mkdir changes? If any specific call actually fails on Windows under tmp_path, please flag the exact file:line and share the traceback — that'd be a genuinely interesting case and we'd want to understand the root cause rather than just add exist_ok.

(The two pre-existing mkdir(parents=True, exist_ok=True) calls you didn't touch — session_cmd.py:27 production code and the root.mkdir helpers — are legitimate; those paths have real reuse semantics.)

A couple of smaller asks:

  1. Flesh out the PR body — 21 files / 415 lines with a 2-line description makes the PR hard to justify 6 months from now when someone reads git blame. At minimum:

    • What problem did you hit? ("On Windows, test_X fails with UnicodeDecodeError: ... because ..." — a traceback from your local Windows setup would be ideal, even just the exception type.)
    • Why is utf-8 the right encoding (not ascii, utf-16, etc.)?
    • Confirmation that you ran the test suite locally on Windows post-change.
  2. Title tweak — "fix test bugs" doesn't describe what's changing. Suggestion: test: make file I/O encoding-explicit for Windows compatibility (or similar). Drops the ambiguous "fix test bugs" framing.

Merge-order note: this PR touches session_cmd.py alongside #146 (state-dir refactor) and #157 (activity-hook logger). Different lines, so no conflicts, but #146 is likely to merge first (it's an open design discussion), after which you'd need a one-line rebase. No action required now.

Once the mkdir revert + body/title updates land, I'll do another pass. The encoding="utf-8" work itself is solid — this is just about tightening the scope.

@powerzist
Copy link
Copy Markdown
Collaborator Author

powerzist commented Apr 16, 2026

On Windows, I encountered the following error where a directory could not be created because it already exists:

packages\memtomem\tests\test_cli_ingest.py:551:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = WindowsPath('C:/Users/user/AppData/Local/Temp/pytest-of-user/pytest-30/test_happy_path_indexes_codex_0/memories')
mode = 511, parents = False, exist_ok = False

    def mkdir(self, mode=0o777, parents=False, exist_ok=False):
        """
        Create a new directory at this given path.
        """
        try:
>           os.mkdir(self, mode)
E           FileExistsError: [WinError 183] 파일이 이미 있으므로 만들 수 없습니다: 'C:\\Users\\user\\AppData\\Local\\Temp\\pytest-of-user\\pytest-30\\test_happy_path_indexes_codex_0\\memories'

..\..\AppData\Roaming\uv\python\cpython-3.13.5-windows-x86_64-none\Lib\pathlib\_local.py:722: FileExistsError


packages\memtomem\tests\test_cli_ingest.py:581:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = WindowsPath('C:/Users/user/AppData/Local/Temp/pytest-of-user/pytest-30/test_rerun_skips_unchanged1/memories')
mode = 511, parents = False, exist_ok = False

    def mkdir(self, mode=0o777, parents=False, exist_ok=False):
        """
        Create a new directory at this given path.
        """
        try:
>           os.mkdir(self, mode)
E           FileExistsError: [WinError 183] 파일이 이미 있으므로 만들 수 없습니다: 'C:\\Users\\user\\AppData\\Local\\Temp\\pytest-of-user\\pytest-30\\test_rerun_skips_unchanged1\\memories'

..\..\AppData\Roaming\uv\python\cpython-3.13.5-windows-x86_64-none\Lib\pathlib\_local.py:722: FileExistsError

There may be multiple possible causes for this issue, but based on my observations, I would like to share one possible explanation.

It seems that the temporary directory system creates up to three directories, and then starts reusing them. When reusing a directory, it attempts to delete the existing contents before creating it again. However, on Windows, some files inside the directory may still be locked, which can cause the deletion step to fail. As a result, the directory remains in place, and a FileExistsError is raised when attempting to recreate it.

In my environment, I consistently observed that only up to three directories were present under the temporary path, which appears to align with this reuse behavior.

Although macOS would also raise an error if a directory already exists, I understand that this issue may not occur there because the cleanup from previous runs completes successfully. I have not personally verified this on macOS, so this part is based on assumption.

After applying a fix to handle this situation more safely on Windows, I was able to confirm that the issue no longer occurs in my tests.

To help prevent similar issues that may arise in the future, I reviewed other parts of the code with similar patterns and added the exist_ok=True argument where applicable.

Copy link
Copy Markdown
Owner

@memtomem memtomem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution!

The source change in session_cmd.py (explicit encoding="utf-8") is valid for Windows compatibility — happy to merge that part.

However, the bulk test changes (20 files) adding encoding="utf-8" to every write_text/read_text and exist_ok=True to mkdir calls are unnecessary:

  • exist_ok=True on tmp_path: pytest's tmp_path is guaranteed fresh per test — the directory never pre-exists, so exist_ok adds no value.
  • encoding="utf-8" on test ASCII strings: These tests write ASCII-only strings like "hello world" or "b" — encoding is irrelevant and just adds noise.

Could you reduce the scope to only the source file change (session_cmd.py)? That keeps the PR focused and avoids touching every test file for no functional benefit.

@memtomem
Copy link
Copy Markdown
Owner

Thanks @powerzist — the traceback is genuinely helpful and I owe you a correction. My earlier "tmp_path is always fresh per test" framing was too absolute: pytest's default tmp_path_retention_count=3 does reuse directory roots across sessions, and if the previous session's rmtree cleanup fails, the next run sees the leftover directory. On Linux/macOS rmtree generally succeeds even while files are still being written; on Windows it blocks on any open handle. So the symptom you're reporting is real.

That said, I still think the root cause is upstream of mkdir. The components fixture (conftest.py:41-71) tears down via close_components(), which only closes storage, embedder, search_pipeline, and llm — if any of those retain an OS-level handle after await …close() returns (SQLite WAL/SHM sidecars, ONNX Runtime's mmap of the model file, etc.), pytest's subsequent cleanup of that tmp dir fails on Windows, and the leftover memories/ is the downstream consequence. Suppressing with blanket exist_ok=True hides the leak; the leftover directory may also carry over a stale test.db/test.db-wal from the previous run, which could cause quieter flakes later that are much harder to bisect.

Two concrete asks:

  1. Scope split. Could we make this PR just the encoding="utf-8" changes? Those are unambiguously right on every line they touch and I'll merge quickly. Revert the mkdir(exist_ok=True) additions here (plus the ones that aren't handle-related at all — _derive_slug / _discover_claude_slug_dirs / etc. tests don't open DB connections, so those definitely can't be hitting your race).

  2. Follow-up for the actual handle leak. Would you be willing to open a separate issue titled something like "Windows: test components fixture leaves SQLite/embedder handles open, blocking pytest cleanup" with your traceback and the "up to 3 reused dirs" observation? That triage is the valuable part of what you've discovered — a fix probably lives inside close_components() or one of the .close() methods it calls, not in per-site exist_ok=True. I'll open it myself and credit your findings if you'd rather not.

If the underlying handle leak turns out to be genuinely unfixable at the source (e.g. a third-party library holding a C-level lock we can't release), site-targeted exist_ok=True on the exact paths that demonstrate the race (conftest.py:48, test_cli_ingest.py:551, test_cli_ingest.py:581) would be a reasonable fallback — but scoped to those three spots with a comment pointing at the upstream cause, not applied across sites that don't hold handles.

Happy to iterate on whichever direction is lighter on your side.

@powerzist
Copy link
Copy Markdown
Collaborator Author

I agree with your point. However, I have been away since yesterday, so I may need a little time to address it.
If you could assist by opening an issue, I would greatly appreciate it.
I will handle the remaining work as soon as possible once I return.

@memtomem
Copy link
Copy Markdown
Owner

Opened #206 for the handle-leak investigation as discussed — credited your traceback and the 3-dir reuse observation. No rush on this PR; take your time when you're back. The scope split (encoding-only) and a quick rebase on top of any intervening commits is all this PR needs.

@powerzist
Copy link
Copy Markdown
Collaborator Author

Thank you again for your thoughtful suggestion, and I apologize for the delayed response as I have just returned and things have been quite busy on my end.

Regarding the encoding issue, I understand and respect your point that explicitly specifying UTF-8 may be unnecessary when the content is purely ASCII. I agree that, in such cases, leaving it unspecified can be a reasonable choice.

That said, my concern is that trying to distinguish between ASCII-only cases and all other cases may introduce additional complexity and maintenance overhead without much practical benefit. In other words, the cost of handling those cases differently may be greater than the cost of consistently specifying UTF-8.

For example, in cases like the following:

self = WindowsPath('C:/Users/user/AppData/Local/Temp/pytest-of-user/pytest-30/test_edited_file_is_reindexed_0/fake_home/.claude/projects/-Users-test-Work-demo-project/memory/feedback_a.md')
data = '# Feedback A\n\nRevised guidance: prefer bge-m3 for multilingual corpora — the Korean coverage is measurably better than bge-small, and the English quality is comparable.\n'
encoding = 'locale', errors = None, newline = None

    def write_text(self, data, encoding=None, errors=None, newline=None):
        """
        Open the file in text mode, write to it, and close the file.
        """
        if not isinstance(data, str):
            raise TypeError('data must be str, not %s' %
                            data.__class__.__name__)
        with self.open(mode='w', encoding=encoding, errors=errors, newline=newline) as f:
>           return f.write(data)
                   ^^^^^^^^^^^^^
E           UnicodeEncodeError: 'cp949' codec can't encode character '\u2014' in position 73: illegal multibyte sequence

..\..\AppData\Roaming\uv\python\cpython-3.13.5-windows-x86_64-none\Lib\pathlib\_abc.py:652: UnicodeEncodeError

the error is explicitly shown as a UnicodeEncodeError, so the root cause is relatively clear.

There are also cases like this:

self = <tests.test_e2e_pipeline.TestCrossLanguageE2E object at 0x0000021546BC6490>
components = Components(config=Mem2MemConfig(embedding=EmbeddingConfig(provider='none', model='bge-m3', dimension=1024, base_url=''... 0x0000021546BCFA50>, search_pipeline=<memtomem.search.pipeline.SearchPipeline object at 0x00000215477C2B30>, llm=None)
memory_dir = WindowsPath('C:/Users/user/AppData/Local/Temp/pytest-of-user/pytest-30/test_mixed_language_context0/memories')

    async def test_mixed_language_context(self, components, memory_dir):
        """Mixed-language doc with context expansion preserves both languages."""
        doc = memory_dir / "mixed.md"
        doc.write_text(
            "## Project Overview\n\n"
            "Building a knowledge management platform.\n\n"
            "## 기술 스택\n\n"
            "Python 3.12, FastAPI, SQLite with FTS5.\n\n"
            "## Architecture\n\n"
            "Monorepo with uv workspace. Two packages.\n"
        )
        await components.index_engine.index_file(doc)

        results, _ = await components.search_pipeline.search("기술 스택", top_k=1, context_window=1)
>       assert len(results) >= 1
E       assert 0 >= 1
E        +  where 0 = len([])

packages\memtomem\tests\test_e2e_pipeline.py:185: AssertionError

this appears at first to be a search failure, but the underlying cause is still the encoding issue. In cases like this, identifying the true cause requires additional investigation, and explicitly using UTF-8 also resolves the problem.

The examples above are all from the test code. I understand that you suggested narrowing the scope to session_cmd.py, but in practice the same kind of change seems to be needed in the tests as well.

Since UTF-8 is backward-compatible with ASCII, it works for ASCII content without requiring separate handling. Because of that, my impression is that the downside of explicitly specifying UTF-8 even for ASCII-only content is smaller than the downside of splitting the logic between ASCII and non-ASCII cases.

Of course, I do not believe there is a single absolute right answer here, and I fully respect your judgment on the direction we should take. Before I proceed with the changes, I just wanted to share my reasoning in a bit more detail and kindly ask whether, after considering the points above, you would still prefer that I keep the scope limited. I will be happy to follow your decision either way, but I would sincerely appreciate it if you could take one more look at my perspective.

@memtomem
Copy link
Copy Markdown
Owner

You're right, and thanks for pushing back with concrete tracebacks — both examples are convincing and I'm withdrawing my earlier scope-narrowing ask.

The second one in particular (test_mixed_language_context silently returning 0 results) is the decisive case. A UnicodeEncodeError at least points at its own root cause; a search-result assertion failure whose actual cause is cp949-dropped non-ASCII bytes is exactly the kind of issue that eats hours to bisect on Windows and looks like a ranking/embedding bug on the way there. The "ASCII-only strings don't need encoding" argument falls apart the moment a test fixture grows a Korean section or an em-dash — which, as your examples show, several of ours already have and more will in the future.

The "consistent encoding="utf-8" everywhere" rule is also the right one to hold in code review going forward: it's mechanical to enforce, it can't regress silently, and the cost is a few extra characters per call site. Trying to maintain an ASCII-vs-non-ASCII split-by-inspection is exactly the kind of rule that decays.

So: please keep all the encoding="utf-8" changes in this PR. Sorry for the back-and-forth on that.

For the mkdir(exist_ok=True) changes — those are a separate concern and we've already split them off into #206, so let's drop them from this PR and track the handle-leak investigation there. When you're ready:

  1. Revert only the mkdir(exist_ok=True) additions (keep every encoding="utf-8").
  2. A one-liner PR body update noting the two failure modes you hit on Windows (UnicodeEncodeError in write_text, and the silent mixed-language search failure) would help future git blame readers understand the scope.
  3. Rebase on main if needed — nothing has landed that should conflict.

No rush. Appreciate you taking the time to make the case clearly.

@powerzist powerzist changed the title fix: improve Windows compatibility and fix test bugs fix: improve Windows compatibility Apr 19, 2026
@powerzist powerzist changed the title fix: improve Windows compatibility fix: improve Windows compatibility - enforce encoding UTF-8 Apr 19, 2026
@powerzist powerzist changed the title fix: improve Windows compatibility - enforce encoding UTF-8 fix: improve Windows compatibility - enforce UTF-8 encoding Apr 19, 2026
@powerzist powerzist changed the title fix: improve Windows compatibility - enforce UTF-8 encoding fix: improve Windows compatibility — enforce UTF-8 encoding Apr 19, 2026
@memtomem
Copy link
Copy Markdown
Owner

Thanks @powerzist — the scope split looks great (encoding-only, mkdir reverts confirmed) and the PR body now captures both Windows failure modes clearly.

Two small things left before merge:

  1. Rebase on main — there are conflicts now (a few test files have moved since you branched). A rebase + force-push to your develop branch should resolve them.
  2. CI re-trigger — as a side benefit, the rebase will also kick CI for the new HEAD. The current SHA b33e7df5 ended up with only the CLA workflow run (CI itself didn't trigger this time, looks like a one-off — earlier pushes on this branch ran CI fine), so a fresh push is the cleanest way to get a green run.

I'll re-review and merge once it's green. Thanks again for sticking with this through the back-and-forth.

- enforce UTF-8 encoding
@memtomem
Copy link
Copy Markdown
Owner

Hey @powerzist — looks like the rebase landed but with unresolved conflict markers (the <<<<<<< / ======= / >>>>>>> lines) still in the source. CI is failing on syntax errors as a result. Two files affected: packages/memtomem/src/memtomem/cli/session_cmd.py and packages/memtomem/tests/test_cli.py (5 unresolved blocks total).

What happened: between when you branched and now, PR #146 refactored _STATE_FILE (a constant) into a _state_file() function in session_cmd.py. So the conflict was real — you need to combine both changes: keep main's new _state_file() call and add your encoding="utf-8". For example, line 30 should resolve to:

text = _state_file().read_text(encoding="utf-8").strip()

The same kind of resolution applies to the other markers — for each <<<<<<< HEAD ... ======= ... >>>>>>> block, take the new shape from HEAD (main) and apply your encoding/etc. change on top.

Recommended workflow:

git rebase --abort                    # if rebase still in progress
git fetch upstream main               # or origin if upstream is set there
git rebase upstream/main
# for each conflicted file: open it, find <<<<<<<, manually merge
# then:
git add <files>
git rebase --continue
# before push, sanity-check no markers leaked:
grep -rn "<<<<<<< HEAD" packages/
uv run ruff check packages/memtomem/src packages/memtomem/tests
git push --force-with-lease origin develop

The grep step is the cheap insurance against this repeating. Sorry for the rough edge here — let me know if anything in the resolution is unclear.

@powerzist
Copy link
Copy Markdown
Collaborator Author

I’m sorry — that was my mistake. I’ve updated the five conflict-marker blocks you pointed out to use the latest changes.

@memtomem
Copy link
Copy Markdown
Owner

Almost there — the 17 test files look great and CI is fully green. One thing got lost in the conflict resolution though: the encoding="utf-8" additions in packages/memtomem/src/memtomem/cli/session_cmd.py dropped out. That was the production-code change in this PR (the test changes are important too, but session_cmd is the one that actually affects users on Windows).

Current state on main (after PR #146's refactor):

# line 30
text = _state_file().read_text().strip()
# line 38
_state_file().write_text(session_id + "\n")

Both need encoding="utf-8" added:

text = _state_file().read_text(encoding="utf-8").strip()
_state_file().write_text(session_id + "\n", encoding="utf-8")

Once that's pushed, I'll merge. Sorry for the extra round-trip — this one's subtle because the file no longer shows up in the diff at all, so it's easy to miss.

@powerzist
Copy link
Copy Markdown
Collaborator Author

I’m sorry about that. I’ve added the missing parts. I hope there are no further issues now.

Copy link
Copy Markdown
Owner

@memtomem memtomem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for sticking with this through multiple rounds — the final diff is exactly right. Merging now.

@memtomem memtomem merged commit 920b363 into memtomem:main Apr 19, 2026
7 checks passed
@github-actions github-actions Bot locked and limited conversation to collaborators Apr 19, 2026
@powerzist powerzist deleted the develop branch April 21, 2026 04:40
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants