Skip to content

ProjectAnnotationsOntoHtml fails with Xml_MultipleRoots when HTML has multiple top-level elements #110

@JSv4

Description

@JSv4

Description

ProjectAnnotationsOntoHtml fails with Xml_MultipleRoots when the input HTML contains multiple top-level elements (e.g., a <style> block followed by a <div>). This happens because the function parses the HTML as XML internally, and XML requires a single root element.

Root Cause

The convertDocxToHtml() output is a complete HTML document:

<html><head><style>...docxodus formatting CSS...</style></head><body><div class="docx-000000">...</div></body></html>

In security-conscious applications, this HTML is sanitized (e.g., via DOMPurify) before being passed to ProjectAnnotationsOntoHtml. Sanitizers strip the <html>/<head>/<body> wrapper tags but preserve content, producing:

<style>.docx-Normal { font-family: 'Arial'... }</style><div class="docx-000000">...</div>

This is valid HTML but invalid XML — two root elements (<style> and <div>). The WASM ProjectAnnotationsOntoHtml function then fails:

Error: Failed to project annotations: Xml_MessageWithErrorPosition, Xml_MultipleRoots, 1, 25

The error position 1, 25 points to column 25 of line 1, which is where the second root element (<div>) begins after the </style> closing tag.

Reproduction

import { initialize, convertDocxToHtml, projectAnnotationsOntoHtml } from "docxodus";

await initialize();

const html = await convertDocxToHtml(docxBytes);

// Simulate what DOMPurify or any sanitizer does — strip document wrapper
// This produces valid HTML but with multiple top-level elements
const stripped = html
  .replace(/<\/?html[^>]*>/gi, '')
  .replace(/<\/?head[^>]*>/gi, '')
  .replace(/<\/?body[^>]*>/gi, '');

// Or more simply, any HTML with a <style> before the content div:
const multiRoot = `<style>.test { color: red; }</style><div>Hello world</div>`;

const annotationSet = {
  title: "",
  content: "Hello world",
  pageCount: 1,
  pawlsFileContent: [],
  docLabels: [],
  labelledText: [{
    id: "ann-1",
    annotationLabel: "TEST",
    rawText: "Hello",
    page: 0,
    annotationJson: { start: 0, end: 5 },
    structural: false,
  }],
  textLabels: {
    "TEST": { id: "TEST", text: "Test", color: "#FF0000", description: "", icon: "", labelType: "text" }
  },
  docLabelDefinitions: {},
  documentId: "",
  documentHash: "",
  createdAt: new Date().toISOString(),
  updatedAt: new Date().toISOString(),
  version: "1.0",
};

// This throws: Xml_MessageWithErrorPosition, Xml_MultipleRoots, 1, 25
await projectAnnotationsOntoHtml(stripped, annotationSet);

Current Workaround

Callers must wrap sanitized HTML in a <div> before passing to ProjectAnnotationsOntoHtml:

if (!sanitized.trimStart().startsWith("<div")) {
  sanitized = `<div>${sanitized}</div>`;
}

This is fragile — it relies on string prefix matching and adds an extra wrapper div that wasn't in the original output.

Suggested Fix

The WASM ProjectAnnotationsOntoHtml function could:

  1. Auto-wrap in a root element if the input has multiple top-level elements, or
  2. Use an HTML parser instead of an XML parser for the input (since the input IS HTML, not XML), or
  3. Document the single-root requirement in the TypeScript type/JSDoc for the function

Option 1 is simplest — detect multiple roots and wrap in <div> internally before XML parsing, then strip the wrapper from the output.

Related: Xml_UndeclaredEntity

The same XML-not-HTML parsing issue causes Xml_UndeclaredEntity errors when the HTML contains named entities like &nbsp;, &ndash;, etc. XML only recognizes &amp;, &lt;, &gt;, &quot;, &apos;. Callers must convert all HTML named entities to numeric references (&nbsp;&#160;) before calling ProjectAnnotationsOntoHtml.

If the function used an HTML parser (or at minimum declared HTML entities in the XML DOCTYPE), both issues would be resolved.

Stack Trace

Error: Failed to project annotations: Xml_MessageWithErrorPosition, Xml_MultipleRoots, 1, 25
    projectAnnotationsOntoHtml index.ts:1975

Internal path in docxodus index.js:

// Line ~1474
const result = exports.DocumentConverter.ProjectAnnotationsOntoHtml(
  html,                    // ← parsed as XML, fails on multiple roots
  annotationSetJson,
  projectionOptions?.cssClassPrefix ?? "ext-annot-",
  projectionOptions?.labelMode ?? AnnotationLabelMode.Above
);
if (isErrorResponse(result)) {
  const error = parseError(result);
  throw new Error(`Failed to project annotations: ${error.error}`);
}

Environment

  • docxodus (npm): 5.5.0
  • Browser: Firefox / Chromium
  • Sanitizer: DOMPurify 3.x with FORCE_BODY: true

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions