Skip to content

Fix annotation projection on sanitized HTML fragments#111

Merged
JSv4 merged 1 commit intomainfrom
claude/investigate-issue-110-2xxWx
Mar 21, 2026
Merged

Fix annotation projection on sanitized HTML fragments#111
JSv4 merged 1 commit intomainfrom
claude/investigate-issue-110-2xxWx

Conversation

@JSv4
Copy link
Copy Markdown
Owner

@JSv4 JSv4 commented Mar 21, 2026

Summary

Fixes annotation projection methods to handle HTML fragments with multiple root elements and HTML named entities, which are common in sanitized HTML output from libraries like DOMPurify.

Key Changes

  • HTML Fragment Parsing: Added ParseHtmlString() method that:

    • Replaces HTML named entities ( , –, etc.) with numeric XML equivalents for XML compatibility
    • Wraps multi-root HTML in a synthetic <docxodus-fragment-root> element to satisfy XElement.Parse() requirements
    • Falls back to wrapping only when single-root parsing fails
  • HTML Serialization: Added SerializeHtmlString() method that:

    • Removes the synthetic wrapper when serializing back to string
    • Preserves original HTML structure for single-root documents
  • Updated Methods: Modified three public methods to use the new parsing/serialization:

    • ProjectAnnotationsOntoHtml()
    • AddAnnotationToHtml()
    • RemoveAnnotationFromHtml()
  • Comprehensive Tests: Added 6 new test cases covering:

    • Multiple root elements handling
    • HTML named entities (generic and &nbsp; specifically)
    • Backward compatibility with single-root HTML
    • All three affected methods

Implementation Details

  • HTML entity mapping includes 25 common entities (nbsp, ndash, mdash, quotes, currency symbols, etc.)
  • Synthetic wrapper is only added when necessary (multi-root detection via exception handling)
  • No changes to annotation logic itself—purely input/output handling improvements
  • Maintains backward compatibility with well-formed single-root HTML documents

https://claude.ai/code/session_01B3Yy9hY9o3iUBf5KD8wXUP

…ts (#110)

XElement.Parse() requires valid XML with a single root element, but
sanitized HTML (e.g., DOMPurify output) strips <html>/<body> wrappers,
leaving multiple top-level elements. Also handle HTML named entities
(&nbsp;, &ndash;, etc.) that are invalid in XML.

https://claude.ai/code/session_01B3Yy9hY9o3iUBf5KD8wXUP
@JSv4 JSv4 merged commit ecc8771 into main Mar 21, 2026
10 checks passed
@JSv4 JSv4 deleted the claude/investigate-issue-110-2xxWx branch March 21, 2026 00:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ProjectAnnotationsOntoHtml fails with Xml_MultipleRoots when HTML has multiple top-level elements

2 participants