feat(converter): align whitespace handling with LibreOffice HTML output#91
Merged
feat(converter): align whitespace handling with LibreOffice HTML output#91
Conversation
- Remove white-space: pre-wrap from default CSS to match LibreOffice behavior - Add ConvertTextWithNbsp() to convert significant whitespace to entities (multiple consecutive spaces, leading/trailing spaces in text runs) - Add NormalizeInlineWhitespace() to remove whitespace text nodes between inline elements in the XElement tree - Add ToHtmlString() helper method for proper HTML serialization that removes formatting whitespace between inline elements while preserving indentation This ensures that HTML formatting (newlines/indentation) doesn't create visible spaces between adjacent spans/anchors in the rendered output, matching how LibreOffice generates HTML from DOCX documents.
Use \u00A0 character instead of entity to ensure HTML output can be parsed as valid XML by tests.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
white-space: pre-wrapfrom default CSS to match LibreOffice behaviorConvertTextWithNbsp()to convert significant whitespace to entitiesNormalizeInlineWhitespace()to remove whitespace text nodes between inline elementsToHtmlString()helper method for proper HTML serializationProblem
HTML whitespace between inline elements was rendering as visible spaces, causing output like "FIRST : The name..." instead of "FIRST: The name...". This was due to standard HTML whitespace handling where newlines/indentation between inline elements render as visible space.
Solution
Multi-pronged approach aligned with LibreOffice's HTML output:
white-space: pre-wrapCSS that was preserving all whitespace conversion for significant whitespace (leading/trailing/multiple spaces)ToHtmlString()post-processor to strip formatting whitespace from serialized outputTest plan