Skip to content

fix: suggest repair-encoding when invalid string encoding is encountered#97

Closed
liuxiaopai-ai wants to merge 1 commit intowesm:mainfrom
liuxiaopai-ai:fix/encoding-error-repair-hint
Closed

fix: suggest repair-encoding when invalid string encoding is encountered#97
liuxiaopai-ai wants to merge 1 commit intowesm:mainfrom
liuxiaopai-ai:fix/encoding-error-repair-hint

Conversation

@liuxiaopai-ai
Copy link
Contributor

Summary

When users encounter an Invalid string encoding found in Parquet file error in the TUI or CLI commands, they currently see an opaque error with no guidance on how to fix it. This PR adds a user-friendly hint suggesting msgvault repair-encoding whenever this specific error is detected.

Closes #95

Changes

  • internal/query/encoding_hint.go: New helper functions IsEncodingError() and HintRepairEncoding() that detect the DuckDB encoding error string and wrap it with the hint:

    Hint: try running 'msgvault repair-encoding' to fix encoding issues
    
  • internal/tui/model.go: Applied HintRepairEncoding() to all five TUI error handlers (aggregate data, message list, message detail, thread messages, search results) so the hint appears inline in the TUI error display.

  • CLI commands: Applied HintRepairEncoding() to error paths in search, list-senders, list-domains, and list-labels commands.

  • internal/query/encoding_hint_test.go: Unit tests covering nil errors, unrelated errors, direct encoding errors, and wrapped encoding errors — verifying both detection and hint injection with proper error chain preservation.

Design

The approach uses string matching on the error message ("Invalid string encoding found in Parquet file") since this error originates from the DuckDB C library at runtime and is not available as a typed error. The hint is appended via fmt.Errorf("%w\nHint: ...") to preserve the original error chain for errors.Is/errors.As compatibility.

@github-actions
Copy link

github-actions bot commented Feb 7, 2026

Security Review: No High/Medium Issues Found

Claude's automated security review did not identify any high or medium severity security concerns in this PR.

Note: This is an automated review and should not replace human security review, especially for changes involving:

  • OAuth token handling
  • Email data access or export
  • Deletion operations (Gmail API)
  • Database queries (SQL injection surface)
  • File system operations (path traversal)
  • CGO or native dependencies

Powered by Claude 4.5 Sonnet

@wesm
Copy link
Owner

wesm commented Feb 7, 2026

Thank you. I am fixing the underlying problem in #98 so that calling repair-encoding is not necessary for new data. I'll pull your change into that PR so I can review and add tests together

@wesm
Copy link
Owner

wesm commented Feb 7, 2026

Closing since this is superseded by #98 (I cherry-picked your commit). Thank you for the contribution!

@wesm wesm closed this Feb 7, 2026
wesm added a commit that referenced this pull request Feb 7, 2026
## Summary

- Validate participant address names and attachment filenames with
`EnsureUTF8()` before database insertion during sync, closing a gap
where subject/body/snippet were already validated but address names and
filenames were not
- The root cause: emails with mis-labeled RFC 2047 headers (e.g.
`=?UTF-8?Q?Jane_Doe=C9ric?=` where the bytes are actually Latin-1) pass
through enmime's address parser with invalid UTF-8, get inserted into
the `participants` table as-is, then cause DuckDB errors when exported
to Parquet
- When invalid string encoding is encountered in query results, show a
hint suggesting `msgvault repair-encoding` to fix existing data
(cherry-picked from #97)

Fixes #95. Supersedes #97 

## Test plan

- [x] `TestFullSync_Latin1InFromName` - Latin-1 É in From name via
mis-labeled RFC 2047
- [x] `TestFullSync_InvalidUTF8InAllAddressFields` - Windows-1252 smart
quotes in From/To/Cc/Bcc
- [x] `TestFullSync_InvalidUTF8InAttachmentFilename` - attachment
filename validation
- [x] `TestFullSync_MultipleEncodingIssuesSameMessage` - mixed Latin-1 +
Windows-1252 in one email
- [x] `TestEncodingErrorHint` - repair-encoding hint on DuckDB encoding
errors
- [x] Full test suite passes
- [x] Linter passes

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.6 <[email protected]>
Co-authored-by: liuxiaopai-ai <[email protected]>
wesm added a commit to robelkin/msgvault that referenced this pull request Feb 7, 2026
…esm#98)

## Summary

- Validate participant address names and attachment filenames with
`EnsureUTF8()` before database insertion during sync, closing a gap
where subject/body/snippet were already validated but address names and
filenames were not
- The root cause: emails with mis-labeled RFC 2047 headers (e.g.
`=?UTF-8?Q?Jane_Doe=C9ric?=` where the bytes are actually Latin-1) pass
through enmime's address parser with invalid UTF-8, get inserted into
the `participants` table as-is, then cause DuckDB errors when exported
to Parquet
- When invalid string encoding is encountered in query results, show a
hint suggesting `msgvault repair-encoding` to fix existing data
(cherry-picked from wesm#97)

Fixes wesm#95. Supersedes wesm#97 

## Test plan

- [x] `TestFullSync_Latin1InFromName` - Latin-1 É in From name via
mis-labeled RFC 2047
- [x] `TestFullSync_InvalidUTF8InAllAddressFields` - Windows-1252 smart
quotes in From/To/Cc/Bcc
- [x] `TestFullSync_InvalidUTF8InAttachmentFilename` - attachment
filename validation
- [x] `TestFullSync_MultipleEncodingIssuesSameMessage` - mixed Latin-1 +
Windows-1252 in one email
- [x] `TestEncodingErrorHint` - repair-encoding hint on DuckDB encoding
errors
- [x] Full test suite passes
- [x] Linter passes

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.6 <[email protected]>
Co-authored-by: liuxiaopai-ai <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Repair hint when encountering invalid string encoding

2 participants