Skip to content

feat(converter): align whitespace handling with LibreOffice HTML output#91

Merged
JSv4 merged 2 commits intomainfrom
feature/nested-list-styles
Dec 23, 2025
Merged

feat(converter): align whitespace handling with LibreOffice HTML output#91
JSv4 merged 2 commits intomainfrom
feature/nested-list-styles

Conversation

@JSv4
Copy link
Copy Markdown
Owner

@JSv4 JSv4 commented Dec 23, 2025

Summary

  • Remove white-space: pre-wrap from default CSS to match LibreOffice behavior
  • Add ConvertTextWithNbsp() to convert significant whitespace to   entities
  • Add NormalizeInlineWhitespace() to remove whitespace text nodes between inline elements
  • Add ToHtmlString() helper method for proper HTML serialization

Problem

HTML whitespace between inline elements was rendering as visible spaces, causing output like "FIRST : The name..." instead of "FIRST: The name...". This was due to standard HTML whitespace handling where newlines/indentation between inline elements render as visible space.

Solution

Multi-pronged approach aligned with LibreOffice's HTML output:

  1. Removed white-space: pre-wrap CSS that was preserving all whitespace
  2. Added   conversion for significant whitespace (leading/trailing/multiple spaces)
  3. Added inline whitespace normalization to remove whitespace text nodes between inline elements
  4. Added ToHtmlString() post-processor to strip formatting whitespace from serialized output

Test plan

  • Build passes
  • Manual conversion of NVCA-Model-COI-10-1-2025.docx shows correct whitespace rendering
  • Verify existing HTML converter tests pass

JSv4 added 2 commits December 22, 2025 22:29
- Remove white-space: pre-wrap from default CSS to match LibreOffice behavior
- Add ConvertTextWithNbsp() to convert significant whitespace to   entities
  (multiple consecutive spaces, leading/trailing spaces in text runs)
- Add NormalizeInlineWhitespace() to remove whitespace text nodes between
  inline elements in the XElement tree
- Add ToHtmlString() helper method for proper HTML serialization that removes
  formatting whitespace between inline elements while preserving indentation

This ensures that HTML formatting (newlines/indentation) doesn't create visible
spaces between adjacent spans/anchors in the rendered output, matching how
LibreOffice generates HTML from DOCX documents.
Use \u00A0 character instead of   entity to ensure HTML output
can be parsed as valid XML by tests.
@JSv4 JSv4 merged commit e448ece into main Dec 23, 2025
6 checks passed
@JSv4 JSv4 deleted the feature/nested-list-styles branch December 23, 2025 03:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant