Fix invalid UTF-8 during sync and suggest repair-encoding on error#98
Merged
Fix invalid UTF-8 during sync and suggest repair-encoding on error#98
Conversation
Participant names from MIME headers were inserted into the participants table without UTF-8 validation, while other text fields (subject, body, snippet) were already validated. This caused DuckDB errors when exporting to Parquet: mis-labeled RFC 2047 headers (e.g. claiming UTF-8 but containing Latin-1 bytes) would produce invalid strings that passed through enmime's address parser unchecked. Fixes #95 Co-Authored-By: Claude Opus 4.6 <[email protected]>
…ing tests Extend UTF-8 validation to attachment filenames and content types before database insertion (defense-in-depth, since enmime already sanitizes filenames). Add three new test cases covering: - Windows-1252 smart quotes in To/Cc/Bcc names (all address fields) - Attachment filename validation - Multiple encoding issues in a single message (Latin-1 + Windows-1252) Co-Authored-By: Claude Opus 4.6 <[email protected]>
- Rewrite TestFullSync_InvalidUTF8InAttachmentFilename to use raw MIME with Latin-1 bytes in the filename, asserting the sanitized output - Add ORDER BY a.id to InspectAttachment for deterministic results - Return both filename and mime_type from InspectAttachment to verify content-type handling in the same test - Document that enmime sanitizes filenames (our EnsureUTF8 is defense-in-depth) and always strips content-type to the base MIME type Co-Authored-By: Claude Opus 4.6 <[email protected]>
wesm
added a commit
to robelkin/msgvault
that referenced
this pull request
Feb 7, 2026
…esm#98) ## Summary - Validate participant address names and attachment filenames with `EnsureUTF8()` before database insertion during sync, closing a gap where subject/body/snippet were already validated but address names and filenames were not - The root cause: emails with mis-labeled RFC 2047 headers (e.g. `=?UTF-8?Q?Jane_Doe=C9ric?=` where the bytes are actually Latin-1) pass through enmime's address parser with invalid UTF-8, get inserted into the `participants` table as-is, then cause DuckDB errors when exported to Parquet - When invalid string encoding is encountered in query results, show a hint suggesting `msgvault repair-encoding` to fix existing data (cherry-picked from wesm#97) Fixes wesm#95. Supersedes wesm#97 ## Test plan - [x] `TestFullSync_Latin1InFromName` - Latin-1 É in From name via mis-labeled RFC 2047 - [x] `TestFullSync_InvalidUTF8InAllAddressFields` - Windows-1252 smart quotes in From/To/Cc/Bcc - [x] `TestFullSync_InvalidUTF8InAttachmentFilename` - attachment filename validation - [x] `TestFullSync_MultipleEncodingIssuesSameMessage` - mixed Latin-1 + Windows-1252 in one email - [x] `TestEncodingErrorHint` - repair-encoding hint on DuckDB encoding errors - [x] Full test suite passes - [x] Linter passes 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 <[email protected]> Co-authored-by: liuxiaopai-ai <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
EnsureUTF8()before database insertion during sync, closing a gap where subject/body/snippet were already validated but address names and filenames were not=?UTF-8?Q?Jane_Doe=C9ric?=where the bytes are actually Latin-1) pass through enmime's address parser with invalid UTF-8, get inserted into theparticipantstable as-is, then cause DuckDB errors when exported to Parquetmsgvault repair-encodingto fix existing data (cherry-picked from fix: suggest repair-encoding when invalid string encoding is encountered #97)Fixes #95. Supersedes #97
Test plan
TestFullSync_Latin1InFromName- Latin-1 É in From name via mis-labeled RFC 2047TestFullSync_InvalidUTF8InAllAddressFields- Windows-1252 smart quotes in From/To/Cc/BccTestFullSync_InvalidUTF8InAttachmentFilename- attachment filename validationTestFullSync_MultipleEncodingIssuesSameMessage- mixed Latin-1 + Windows-1252 in one emailTestEncodingErrorHint- repair-encoding hint on DuckDB encoding errors🤖 Generated with Claude Code