Skip to content

Fix invalid UTF-8 during sync and suggest repair-encoding on error#98

Merged
wesm merged 4 commits intomainfrom
fix/utf8-participant-names
Feb 7, 2026
Merged

Fix invalid UTF-8 during sync and suggest repair-encoding on error#98
wesm merged 4 commits intomainfrom
fix/utf8-participant-names

Conversation

@wesm
Copy link
Owner

@wesm wesm commented Feb 7, 2026

Summary

  • Validate participant address names and attachment filenames with EnsureUTF8() before database insertion during sync, closing a gap where subject/body/snippet were already validated but address names and filenames were not
  • The root cause: emails with mis-labeled RFC 2047 headers (e.g. =?UTF-8?Q?Jane_Doe=C9ric?= where the bytes are actually Latin-1) pass through enmime's address parser with invalid UTF-8, get inserted into the participants table as-is, then cause DuckDB errors when exported to Parquet
  • When invalid string encoding is encountered in query results, show a hint suggesting msgvault repair-encoding to fix existing data (cherry-picked from fix: suggest repair-encoding when invalid string encoding is encountered #97)

Fixes #95. Supersedes #97

Test plan

  • TestFullSync_Latin1InFromName - Latin-1 É in From name via mis-labeled RFC 2047
  • TestFullSync_InvalidUTF8InAllAddressFields - Windows-1252 smart quotes in From/To/Cc/Bcc
  • TestFullSync_InvalidUTF8InAttachmentFilename - attachment filename validation
  • TestFullSync_MultipleEncodingIssuesSameMessage - mixed Latin-1 + Windows-1252 in one email
  • TestEncodingErrorHint - repair-encoding hint on DuckDB encoding errors
  • Full test suite passes
  • Linter passes

🤖 Generated with Claude Code

wesm and others added 2 commits February 7, 2026 06:44
Participant names from MIME headers were inserted into the participants
table without UTF-8 validation, while other text fields (subject, body,
snippet) were already validated. This caused DuckDB errors when
exporting to Parquet: mis-labeled RFC 2047 headers (e.g. claiming UTF-8
but containing Latin-1 bytes) would produce invalid strings that passed
through enmime's address parser unchecked.

Fixes #95

Co-Authored-By: Claude Opus 4.6 <[email protected]>
…ing tests

Extend UTF-8 validation to attachment filenames and content types
before database insertion (defense-in-depth, since enmime already
sanitizes filenames). Add three new test cases covering:

- Windows-1252 smart quotes in To/Cc/Bcc names (all address fields)
- Attachment filename validation
- Multiple encoding issues in a single message (Latin-1 + Windows-1252)

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@wesm wesm changed the title Fix invalid UTF-8 in participant display names during sync Fix invalid UTF-8 during sync and suggest repair-encoding on error Feb 7, 2026
- Rewrite TestFullSync_InvalidUTF8InAttachmentFilename to use raw MIME
  with Latin-1 bytes in the filename, asserting the sanitized output
- Add ORDER BY a.id to InspectAttachment for deterministic results
- Return both filename and mime_type from InspectAttachment to verify
  content-type handling in the same test
- Document that enmime sanitizes filenames (our EnsureUTF8 is
  defense-in-depth) and always strips content-type to the base MIME type

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@wesm wesm merged commit 9b9734f into main Feb 7, 2026
3 checks passed
wesm added a commit to robelkin/msgvault that referenced this pull request Feb 7, 2026
…esm#98)

## Summary

- Validate participant address names and attachment filenames with
`EnsureUTF8()` before database insertion during sync, closing a gap
where subject/body/snippet were already validated but address names and
filenames were not
- The root cause: emails with mis-labeled RFC 2047 headers (e.g.
`=?UTF-8?Q?Jane_Doe=C9ric?=` where the bytes are actually Latin-1) pass
through enmime's address parser with invalid UTF-8, get inserted into
the `participants` table as-is, then cause DuckDB errors when exported
to Parquet
- When invalid string encoding is encountered in query results, show a
hint suggesting `msgvault repair-encoding` to fix existing data
(cherry-picked from wesm#97)

Fixes wesm#95. Supersedes wesm#97 

## Test plan

- [x] `TestFullSync_Latin1InFromName` - Latin-1 É in From name via
mis-labeled RFC 2047
- [x] `TestFullSync_InvalidUTF8InAllAddressFields` - Windows-1252 smart
quotes in From/To/Cc/Bcc
- [x] `TestFullSync_InvalidUTF8InAttachmentFilename` - attachment
filename validation
- [x] `TestFullSync_MultipleEncodingIssuesSameMessage` - mixed Latin-1 +
Windows-1252 in one email
- [x] `TestEncodingErrorHint` - repair-encoding hint on DuckDB encoding
errors
- [x] Full test suite passes
- [x] Linter passes

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.6 <[email protected]>
Co-authored-by: liuxiaopai-ai <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Repair hint when encountering invalid string encoding

2 participants