fix: improve Windows compatibility — enforce UTF-8 encoding#162
fix: improve Windows compatibility — enforce UTF-8 encoding#162memtomem merged 4 commits intomemtomem:mainfrom
Conversation
|
Welcome @powerzist, and thanks for the contribution! On the On the The pytest Concretely, every
Could you revert all the (The two pre-existing A couple of smaller asks:
Merge-order note: this PR touches Once the |
|
On Windows, I encountered the following error where a directory could not be created because it already exists: There may be multiple possible causes for this issue, but based on my observations, I would like to share one possible explanation. It seems that the temporary directory system creates up to three directories, and then starts reusing them. When reusing a directory, it attempts to delete the existing contents before creating it again. However, on Windows, some files inside the directory may still be locked, which can cause the deletion step to fail. As a result, the directory remains in place, and a FileExistsError is raised when attempting to recreate it. In my environment, I consistently observed that only up to three directories were present under the temporary path, which appears to align with this reuse behavior. Although macOS would also raise an error if a directory already exists, I understand that this issue may not occur there because the cleanup from previous runs completes successfully. I have not personally verified this on macOS, so this part is based on assumption. After applying a fix to handle this situation more safely on Windows, I was able to confirm that the issue no longer occurs in my tests. To help prevent similar issues that may arise in the future, I reviewed other parts of the code with similar patterns and added the |
memtomem
left a comment
There was a problem hiding this comment.
Thanks for the contribution!
The source change in session_cmd.py (explicit encoding="utf-8") is valid for Windows compatibility — happy to merge that part.
However, the bulk test changes (20 files) adding encoding="utf-8" to every write_text/read_text and exist_ok=True to mkdir calls are unnecessary:
exist_ok=Trueontmp_path: pytest'stmp_pathis guaranteed fresh per test — the directory never pre-exists, soexist_okadds no value.encoding="utf-8"on test ASCII strings: These tests write ASCII-only strings like"hello world"or"b"— encoding is irrelevant and just adds noise.
Could you reduce the scope to only the source file change (session_cmd.py)? That keeps the PR focused and avoids touching every test file for no functional benefit.
|
Thanks @powerzist — the traceback is genuinely helpful and I owe you a correction. My earlier "tmp_path is always fresh per test" framing was too absolute: pytest's default That said, I still think the root cause is upstream of Two concrete asks:
If the underlying handle leak turns out to be genuinely unfixable at the source (e.g. a third-party library holding a C-level lock we can't release), site-targeted Happy to iterate on whichever direction is lighter on your side. |
|
I agree with your point. However, I have been away since yesterday, so I may need a little time to address it. |
|
Opened #206 for the handle-leak investigation as discussed — credited your traceback and the 3-dir reuse observation. No rush on this PR; take your time when you're back. The scope split (encoding-only) and a quick rebase on top of any intervening commits is all this PR needs. |
|
Thank you again for your thoughtful suggestion, and I apologize for the delayed response as I have just returned and things have been quite busy on my end. Regarding the encoding issue, I understand and respect your point that explicitly specifying UTF-8 may be unnecessary when the content is purely ASCII. I agree that, in such cases, leaving it unspecified can be a reasonable choice. That said, my concern is that trying to distinguish between ASCII-only cases and all other cases may introduce additional complexity and maintenance overhead without much practical benefit. In other words, the cost of handling those cases differently may be greater than the cost of consistently specifying UTF-8. For example, in cases like the following: the error is explicitly shown as a There are also cases like this: this appears at first to be a search failure, but the underlying cause is still the encoding issue. In cases like this, identifying the true cause requires additional investigation, and explicitly using UTF-8 also resolves the problem. The examples above are all from the test code. I understand that you suggested narrowing the scope to Since UTF-8 is backward-compatible with ASCII, it works for ASCII content without requiring separate handling. Because of that, my impression is that the downside of explicitly specifying UTF-8 even for ASCII-only content is smaller than the downside of splitting the logic between ASCII and non-ASCII cases. Of course, I do not believe there is a single absolute right answer here, and I fully respect your judgment on the direction we should take. Before I proceed with the changes, I just wanted to share my reasoning in a bit more detail and kindly ask whether, after considering the points above, you would still prefer that I keep the scope limited. I will be happy to follow your decision either way, but I would sincerely appreciate it if you could take one more look at my perspective. |
|
You're right, and thanks for pushing back with concrete tracebacks — both examples are convincing and I'm withdrawing my earlier scope-narrowing ask. The second one in particular ( The "consistent So: please keep all the For the
No rush. Appreciate you taking the time to make the case clearly. |
|
Thanks @powerzist — the scope split looks great (encoding-only, mkdir reverts confirmed) and the PR body now captures both Windows failure modes clearly. Two small things left before merge:
I'll re-review and merge once it's green. Thanks again for sticking with this through the back-and-forth. |
- enforce UTF-8 encoding
|
Hey @powerzist — looks like the rebase landed but with unresolved conflict markers (the What happened: between when you branched and now, PR #146 refactored text = _state_file().read_text(encoding="utf-8").strip()The same kind of resolution applies to the other markers — for each Recommended workflow: git rebase --abort # if rebase still in progress
git fetch upstream main # or origin if upstream is set there
git rebase upstream/main
# for each conflicted file: open it, find <<<<<<<, manually merge
# then:
git add <files>
git rebase --continue
# before push, sanity-check no markers leaked:
grep -rn "<<<<<<< HEAD" packages/
uv run ruff check packages/memtomem/src packages/memtomem/tests
git push --force-with-lease origin developThe |
|
I’m sorry — that was my mistake. I’ve updated the five conflict-marker blocks you pointed out to use the latest changes. |
|
Almost there — the 17 test files look great and CI is fully green. One thing got lost in the conflict resolution though: the Current state on # line 30
text = _state_file().read_text().strip()
# line 38
_state_file().write_text(session_id + "\n")Both need text = _state_file().read_text(encoding="utf-8").strip()
_state_file().write_text(session_id + "\n", encoding="utf-8")Once that's pushed, I'll merge. Sorry for the extra round-trip — this one's subtle because the file no longer shows up in the diff at all, so it's easy to miss. |
|
I’m sorry about that. I’ve added the missing parts. I hope there are no further issues now. |
memtomem
left a comment
There was a problem hiding this comment.
Thanks for sticking with this through multiple rounds — the final diff is exactly right. Merging now.
Problem
Although memtomem uses UTF-8 encoding, Windows uses cp949 as the default encoding, which causes encoding conflicts.
Below are examples of actual errors that occurred:
This is a typical encoding conflict, and the error message clearly indicates that the issue is caused by an encoding mismatch.
In this case, the immediate cause of the error is a search failure, but the underlying cause of that failure is an encoding mismatch. Therefore, this can also ultimately be considered an encoding conflict issue.
To address the various issues caused by this and reduce the likelihood of similar problems in the future, we discussed enforcing UTF-8 by adding
encoding="utf-8"toread_text()andwrite_text().Conclusion
Enforcing UTF-8 by adding
encoding="utf-8"to everyread_text()andwrite_text()call is not necessarily the cleanest approach and does have some drawbacks. However, at this point, it appears to be the fastest practical way to resolve the related issues.For that reason, the code is being updated in this way for now, with the understanding that this approach may be revisited and changed at any time through further discussion.