-
Notifications
You must be signed in to change notification settings - Fork 1
ProjectAnnotationsOntoHtml fails with Xml_MultipleRoots when HTML has multiple top-level elements #110
Description
Description
ProjectAnnotationsOntoHtml fails with Xml_MultipleRoots when the input HTML contains multiple top-level elements (e.g., a <style> block followed by a <div>). This happens because the function parses the HTML as XML internally, and XML requires a single root element.
Root Cause
The convertDocxToHtml() output is a complete HTML document:
<html><head><style>...docxodus formatting CSS...</style></head><body><div class="docx-000000">...</div></body></html>In security-conscious applications, this HTML is sanitized (e.g., via DOMPurify) before being passed to ProjectAnnotationsOntoHtml. Sanitizers strip the <html>/<head>/<body> wrapper tags but preserve content, producing:
<style>.docx-Normal { font-family: 'Arial'... }</style><div class="docx-000000">...</div>This is valid HTML but invalid XML — two root elements (<style> and <div>). The WASM ProjectAnnotationsOntoHtml function then fails:
Error: Failed to project annotations: Xml_MessageWithErrorPosition, Xml_MultipleRoots, 1, 25
The error position 1, 25 points to column 25 of line 1, which is where the second root element (<div>) begins after the </style> closing tag.
Reproduction
import { initialize, convertDocxToHtml, projectAnnotationsOntoHtml } from "docxodus";
await initialize();
const html = await convertDocxToHtml(docxBytes);
// Simulate what DOMPurify or any sanitizer does — strip document wrapper
// This produces valid HTML but with multiple top-level elements
const stripped = html
.replace(/<\/?html[^>]*>/gi, '')
.replace(/<\/?head[^>]*>/gi, '')
.replace(/<\/?body[^>]*>/gi, '');
// Or more simply, any HTML with a <style> before the content div:
const multiRoot = `<style>.test { color: red; }</style><div>Hello world</div>`;
const annotationSet = {
title: "",
content: "Hello world",
pageCount: 1,
pawlsFileContent: [],
docLabels: [],
labelledText: [{
id: "ann-1",
annotationLabel: "TEST",
rawText: "Hello",
page: 0,
annotationJson: { start: 0, end: 5 },
structural: false,
}],
textLabels: {
"TEST": { id: "TEST", text: "Test", color: "#FF0000", description: "", icon: "", labelType: "text" }
},
docLabelDefinitions: {},
documentId: "",
documentHash: "",
createdAt: new Date().toISOString(),
updatedAt: new Date().toISOString(),
version: "1.0",
};
// This throws: Xml_MessageWithErrorPosition, Xml_MultipleRoots, 1, 25
await projectAnnotationsOntoHtml(stripped, annotationSet);Current Workaround
Callers must wrap sanitized HTML in a <div> before passing to ProjectAnnotationsOntoHtml:
if (!sanitized.trimStart().startsWith("<div")) {
sanitized = `<div>${sanitized}</div>`;
}This is fragile — it relies on string prefix matching and adds an extra wrapper div that wasn't in the original output.
Suggested Fix
The WASM ProjectAnnotationsOntoHtml function could:
- Auto-wrap in a root element if the input has multiple top-level elements, or
- Use an HTML parser instead of an XML parser for the input (since the input IS HTML, not XML), or
- Document the single-root requirement in the TypeScript type/JSDoc for the function
Option 1 is simplest — detect multiple roots and wrap in <div> internally before XML parsing, then strip the wrapper from the output.
Related: Xml_UndeclaredEntity
The same XML-not-HTML parsing issue causes Xml_UndeclaredEntity errors when the HTML contains named entities like , –, etc. XML only recognizes &, <, >, ", '. Callers must convert all HTML named entities to numeric references ( →  ) before calling ProjectAnnotationsOntoHtml.
If the function used an HTML parser (or at minimum declared HTML entities in the XML DOCTYPE), both issues would be resolved.
Stack Trace
Error: Failed to project annotations: Xml_MessageWithErrorPosition, Xml_MultipleRoots, 1, 25
projectAnnotationsOntoHtml index.ts:1975
Internal path in docxodus index.js:
// Line ~1474
const result = exports.DocumentConverter.ProjectAnnotationsOntoHtml(
html, // ← parsed as XML, fails on multiple roots
annotationSetJson,
projectionOptions?.cssClassPrefix ?? "ext-annot-",
projectionOptions?.labelMode ?? AnnotationLabelMode.Above
);
if (isErrorResponse(result)) {
const error = parseError(result);
throw new Error(`Failed to project annotations: ${error.error}`);
}Environment
- docxodus (npm): 5.5.0
- Browser: Firefox / Chromium
- Sanitizer: DOMPurify 3.x with
FORCE_BODY: true