fix: suggest repair-encoding when invalid string encoding is encountered#97
Closed
liuxiaopai-ai wants to merge 1 commit intowesm:mainfrom
Closed
fix: suggest repair-encoding when invalid string encoding is encountered#97liuxiaopai-ai wants to merge 1 commit intowesm:mainfrom
liuxiaopai-ai wants to merge 1 commit intowesm:mainfrom
Conversation
Security Review: No High/Medium Issues FoundClaude's automated security review did not identify any high or medium severity security concerns in this PR. Note: This is an automated review and should not replace human security review, especially for changes involving:
Powered by Claude 4.5 Sonnet |
Owner
|
Thank you. I am fixing the underlying problem in #98 so that calling |
7 tasks
Owner
|
Closing since this is superseded by #98 (I cherry-picked your commit). Thank you for the contribution! |
wesm
added a commit
that referenced
this pull request
Feb 7, 2026
## Summary - Validate participant address names and attachment filenames with `EnsureUTF8()` before database insertion during sync, closing a gap where subject/body/snippet were already validated but address names and filenames were not - The root cause: emails with mis-labeled RFC 2047 headers (e.g. `=?UTF-8?Q?Jane_Doe=C9ric?=` where the bytes are actually Latin-1) pass through enmime's address parser with invalid UTF-8, get inserted into the `participants` table as-is, then cause DuckDB errors when exported to Parquet - When invalid string encoding is encountered in query results, show a hint suggesting `msgvault repair-encoding` to fix existing data (cherry-picked from #97) Fixes #95. Supersedes #97 ## Test plan - [x] `TestFullSync_Latin1InFromName` - Latin-1 É in From name via mis-labeled RFC 2047 - [x] `TestFullSync_InvalidUTF8InAllAddressFields` - Windows-1252 smart quotes in From/To/Cc/Bcc - [x] `TestFullSync_InvalidUTF8InAttachmentFilename` - attachment filename validation - [x] `TestFullSync_MultipleEncodingIssuesSameMessage` - mixed Latin-1 + Windows-1252 in one email - [x] `TestEncodingErrorHint` - repair-encoding hint on DuckDB encoding errors - [x] Full test suite passes - [x] Linter passes 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 <[email protected]> Co-authored-by: liuxiaopai-ai <[email protected]>
wesm
added a commit
to robelkin/msgvault
that referenced
this pull request
Feb 7, 2026
…esm#98) ## Summary - Validate participant address names and attachment filenames with `EnsureUTF8()` before database insertion during sync, closing a gap where subject/body/snippet were already validated but address names and filenames were not - The root cause: emails with mis-labeled RFC 2047 headers (e.g. `=?UTF-8?Q?Jane_Doe=C9ric?=` where the bytes are actually Latin-1) pass through enmime's address parser with invalid UTF-8, get inserted into the `participants` table as-is, then cause DuckDB errors when exported to Parquet - When invalid string encoding is encountered in query results, show a hint suggesting `msgvault repair-encoding` to fix existing data (cherry-picked from wesm#97) Fixes wesm#95. Supersedes wesm#97 ## Test plan - [x] `TestFullSync_Latin1InFromName` - Latin-1 É in From name via mis-labeled RFC 2047 - [x] `TestFullSync_InvalidUTF8InAllAddressFields` - Windows-1252 smart quotes in From/To/Cc/Bcc - [x] `TestFullSync_InvalidUTF8InAttachmentFilename` - attachment filename validation - [x] `TestFullSync_MultipleEncodingIssuesSameMessage` - mixed Latin-1 + Windows-1252 in one email - [x] `TestEncodingErrorHint` - repair-encoding hint on DuckDB encoding errors - [x] Full test suite passes - [x] Linter passes 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 <[email protected]> Co-authored-by: liuxiaopai-ai <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
When users encounter an
Invalid string encoding found in Parquet fileerror in the TUI or CLI commands, they currently see an opaque error with no guidance on how to fix it. This PR adds a user-friendly hint suggestingmsgvault repair-encodingwhenever this specific error is detected.Closes #95
Changes
internal/query/encoding_hint.go: New helper functionsIsEncodingError()andHintRepairEncoding()that detect the DuckDB encoding error string and wrap it with the hint:internal/tui/model.go: AppliedHintRepairEncoding()to all five TUI error handlers (aggregate data, message list, message detail, thread messages, search results) so the hint appears inline in the TUI error display.CLI commands: Applied
HintRepairEncoding()to error paths insearch,list-senders,list-domains, andlist-labelscommands.internal/query/encoding_hint_test.go: Unit tests covering nil errors, unrelated errors, direct encoding errors, and wrapped encoding errors — verifying both detection and hint injection with proper error chain preservation.Design
The approach uses string matching on the error message (
"Invalid string encoding found in Parquet file") since this error originates from the DuckDB C library at runtime and is not available as a typed error. The hint is appended viafmt.Errorf("%w\nHint: ...")to preserve the original error chain forerrors.Is/errors.Ascompatibility.