Skip to content

feat: custom Markdown extensions for partition_md#4292

Merged
PastelStorm merged 11 commits intoUnstructured-IO:mainfrom
claytonlin1110:feat/partition-md-custom-markdown-extensions
Mar 25, 2026
Merged

feat: custom Markdown extensions for partition_md#4292
PastelStorm merged 11 commits intoUnstructured-IO:mainfrom
claytonlin1110:feat/partition-md-custom-markdown-extensions

Conversation

@claytonlin1110
Copy link
Copy Markdown
Contributor

Summary

Closes #4006

  • Adds support for custom Markdown extensions when calling partition_md, defaulting to ["tables"] for backward compatibility.
  • Invalid extensions values log a warning and fall back to ["tables"].

Motivation

Fixes incorrect parsing when fenced code blocks contain # lines (treated as headings without fenced_code).

How to use

from unstructured.partition.md import partition_md

elements = partition_md(text=md, extensions=["fenced_code"])

@claytonlin1110
Copy link
Copy Markdown
Contributor Author

@PastelStorm Would you please review?

@claytonlin1110
Copy link
Copy Markdown
Contributor Author

@PastelStorm Had a chance to review this?

@PastelStorm
Copy link
Copy Markdown
Contributor

@claytonlin1110 a few small findings, after those are fixed this PR should be good to merge :)

Findings

  • Medium: partition_md() rejects the most useful form of Python-Markdown extensions. The new validation only accepts list[str], but Python-Markdown supports passing actual Extension instances, which is the normal way to use configured/custom extensions. On this branch, a valid call like extensions=[MyExtension(...)] would be treated as invalid, logged, and silently replaced with the default extensions, so the advertised feature does not actually support real custom extensions.
    _default_extensions = ["tables", "fenced_code"]
    extensions = kwargs.pop("extensions", _default_extensions)
    if not (isinstance(extensions, list) and all(isinstance(ext, str) for ext in extensions)):
        logging.warning(
            "Ignoring invalid 'extensions' argument (expected list of strings): %r", extensions
        )
        extensions = _default_extensions

    html = markdown.markdown(text, extensions=extensions)
  • Medium: Invalid extensions values fail open instead of fail fast. Because the extension set directly changes the parsed element structure, silently swapping bad input for ["tables", "fenced_code"] can produce the wrong partitioned output without surfacing an API error to the caller. That makes typos and unsupported values hard to diagnose, and the new test locks that behavior in rather than challenging it.
def test_partition_md_invalid_extensions_logs_and_falls_back(mocker: MockFixture):
    """Invalid `extensions` value is ignored with a warning and falls back to the default list."""
    text = "# Heading"
    logger = mocker.patch("unstructured.partition.md.logging.warning")

    elements = partition_md(text=text, extensions="not-a-list")  # type: ignore[arg-type]

    # Still parses something
    assert len(elements) > 0
    # Warning was logged
    logger.assert_called_once()
  • Low: The new tests are too weak for the feature they are adding. test_partition_md_invalid_extensions_logs_and_falls_back() only proves that a warning happened and that some element was returned; it does not verify that the default extension set was actually used. test_partition_md_custom_extensions_parameter() only covers string extension names, so it would not catch the extension-instance regression above.

@claytonlin1110
Copy link
Copy Markdown
Contributor Author

@PastelStorm Thank you for your review.
I just updated.

@PastelStorm PastelStorm enabled auto-merge March 25, 2026 05:06
auto-merge was automatically disabled March 25, 2026 05:25

Head branch was pushed to by a user without write access

@claytonlin1110
Copy link
Copy Markdown
Contributor Author

@PastelStorm I just fixed the test error.

@claytonlin1110
Copy link
Copy Markdown
Contributor Author

@PastelStorm auto-merge is disabled as I just pushed the test fixes

@claytonlin1110
Copy link
Copy Markdown
Contributor Author

All CI passed

@PastelStorm PastelStorm added this pull request to the merge queue Mar 25, 2026
@PastelStorm
Copy link
Copy Markdown
Contributor

All CI passed

Thanks again for your contribution!

@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to a conflict with the base branch Mar 25, 2026
@claytonlin1110 claytonlin1110 force-pushed the feat/partition-md-custom-markdown-extensions branch from c250e67 to ca1faf8 Compare March 25, 2026 17:21
@claytonlin1110
Copy link
Copy Markdown
Contributor Author

@PastelStorm conflicts resolved

@PastelStorm PastelStorm added this pull request to the merge queue Mar 25, 2026
Merged via the queue into Unstructured-IO:main with commit 47f42b1 Mar 25, 2026
52 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat/Support custom extensions for partition_md

2 participants