Skip to content

feat: add create_file_from_elements() to re-create document files from elements#4259

Merged
PastelStorm merged 5 commits intoUnstructured-IO:mainfrom
claytonlin1110:feat/create-file-from-elements
Feb 24, 2026
Merged

feat: add create_file_from_elements() to re-create document files from elements#4259
PastelStorm merged 5 commits intoUnstructured-IO:mainfrom
claytonlin1110:feat/create-file-from-elements

Conversation

@claytonlin1110
Copy link
Copy Markdown
Contributor

@claytonlin1110 claytonlin1110 commented Feb 23, 2026

Summary

Adds create_file_from_elements() in unstructured.staging.base so users can re-build a document file from a list of elements (reverse of partition). Supports the workflow: partition -> modify elements (e.g. replace Image with NarrativeText using alt text) -> write back to file.

Closes #3994.

Changes

  • unstructured/staging/base.py: New create_file_from_elements(elements, format="markdown"|"html"|"text", filename=None, ...) that delegates to elements_to_md, elements_to_html, or elements_to_text and optionally writes to a file.
  • test_unstructured/staging/test_base.py: Tests for markdown, text, and HTML output and for unsupported format raising ValueError.

@claytonlin1110
Copy link
Copy Markdown
Contributor Author

@PastelStorm Would you please review this? Thanks.

@claytonlin1110 claytonlin1110 force-pushed the feat/create-file-from-elements branch from 2961ec2 to f05e045 Compare February 23, 2026 20:42
@claytonlin1110
Copy link
Copy Markdown
Contributor Author

@PastelStorm have a chance to review?

@PastelStorm
Copy link
Copy Markdown
Contributor

@PastelStorm have a chance to review?

Review: create_file_from_elements()

Nice feature — having a unified "elements back to document" API is a good addition. A few things to address before merging:


High — HTML mode silently drops content with default settings

create_file_from_elements() defaults no_group_by_page=False, but the underlying HTML converter (_elements_to_html_tags_by_page) skips any element that lacks metadata.page_number:

# unstructured/partition/html/convert.py, lines 294-299
for element in unstructured_elements:
    page_number = element.metadata.page_number
    if page_number is None:
        logger.warning(f"Page number is not set for an element {element.id}. Skipping.")
        continue
    pages_dict[page_number].append(element)

For many partition sources (e.g. non-paginated documents, or elements that have been manually constructed/modified), this produces nearly empty HTML while the API promises to "re-create a document file." The new test sidesteps this entirely by passing no_group_by_page=True.

Suggestion: Default no_group_by_page=True in create_file_from_elements(), since a round-trip function should preserve all elements by default. If page grouping is desired, callers can opt in explicitly.


Medium — assert used for runtime type narrowing

content = elements_to_text(elements, filename=None, encoding=encoding)
assert content is not None  # we passed filename=None

assert statements are stripped under python -O. In optimized mode this silently lets None through despite the -> str return type. Either add a proper runtime guard:

if content is None:
    raise RuntimeError("elements_to_text returned None unexpectedly")

Or skip the Optional[str] fight altogether and call convert_to_text(elements) directly, which always returns str.


Medium — Test coverage gaps

  • The HTML test forces no_group_by_page=True, so the default (and riskier) code path is never exercised. A test with the default settings on elements that lack page_number would catch the silent-drop behavior above.
  • The text and html tests don't exercise the filename write path. Since the markdown test does, this looks like an oversight — one parameterized test across all three formats would cover it cleanly.
  • No coverage for exclude_binary_image_data passthrough or format case/whitespace normalization (e.g. " Markdown ""markdown").

Low — Format-specific params as top-level arguments

no_group_by_page only applies to HTML. exclude_binary_image_data is irrelevant for text. Surfacing every format-specific knob at the top level won't scale if more options are added. Consider **kwargs pass-through to the underlying converter, or at minimum document which params apply to which formats.


Low — format shadows the built-in

The parameter name format shadows Python's built-in format(). Not a bug here, but easy to avoid (e.g. output_format) and removes a common lint warning.


Low — Write-path duplication

create_file_from_elements() re-implements file writing (open(...).write(...)) rather than delegating to the format-specific functions that already support filename=. This is fine today, but increases drift risk if encoding or output semantics evolve in the individual converters.


What I checked

Area Result
Duplicated code Minor (write-path), not severe
Dead code None in this diff
Incorrect mock usage No mocking introduced
Race conditions None — straightforward single-file writes
Useless tests Not useless, but the HTML test bypasses the critical default path

Summary

The main thing to fix is the HTML default — a function called "create file from elements" shouldn't silently discard elements out of the box. The assert should also be replaced with a proper guard. The rest are smaller improvements that would be nice to clean up while you're in here.

@claytonlin1110
Copy link
Copy Markdown
Contributor Author

@PastelStorm Fixed. please review again.

@claytonlin1110 claytonlin1110 force-pushed the feat/create-file-from-elements branch from 2bea3f3 to e4401a0 Compare February 24, 2026 04:00
@claytonlin1110
Copy link
Copy Markdown
Contributor Author

@PastelStorm Approve required? :)

@PastelStorm PastelStorm added this pull request to the merge queue Feb 24, 2026
Merged via the queue into Unstructured-IO:main with commit d0f8620 Feb 24, 2026
53 of 54 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat/create-file-from-elements

2 participants