Skip to content

feat: custom fallback for language detection#4238

Merged
PastelStorm merged 9 commits intoUnstructured-IO:mainfrom
claytonlin1110:feat/language-detection-custom-fallback
Feb 23, 2026
Merged

feat: custom fallback for language detection#4238
PastelStorm merged 9 commits intoUnstructured-IO:mainfrom
claytonlin1110:feat/language-detection-custom-fallback

Conversation

@claytonlin1110
Copy link
Copy Markdown
Contributor

Closes #4091

Implements custom fallback for language detection so short text is not forced to English and callers can control or disable detection.

Changes:

  • language_fallback
    Optional callable used when text is short (<5 words) and ASCII. It receives the text and can return a list of ISO 639-3 codes or None to leave language unspecified. If not provided, short text still defaults to ["eng"] (backward compatible).
  • detect_languages() / apply_lang_metadata()
    New parameter language_fallback; applied in the short-text path only.
  • partition() (auto)
    New parameter language_fallback; passed through to all partitioners via the metadata decorator.
  • partition_md()
    New parameter languages so callers can pass languages=[""] to disable language detection (aligned with other partitioners).

Usage:

  • Return None for short text: partition(..., language_fallback=lambda text: None)
  • Custom short-text language: partition(..., language_fallback=my_detector)
  • Disable detection: partition_md(..., languages=[""]) or partition(..., languages=[""])

@claytonlin1110
Copy link
Copy Markdown
Contributor Author

@badGarnet Would you please review this since this is my first contribution here.

@claytonlin1110
Copy link
Copy Markdown
Contributor Author

@badGarnet Sorry for pinging you since I'm a new contributor... Please review

@claytonlin1110
Copy link
Copy Markdown
Contributor Author

@PastelStorm I'm a new contributor here i don't know who i need to ping to get this PR reviewed.

@PastelStorm
Copy link
Copy Markdown
Contributor

@PastelStorm I'm a new contributor here i don't know who i need to ping to get this PR reviewed.

Would you mind making a feature branch instead of forking the repo please? Then I will review. Thank you.

@claytonlin1110
Copy link
Copy Markdown
Contributor Author

@PastelStorm Forking is the normal workflow on public git repo to contribute.

@claytonlin1110 claytonlin1110 force-pushed the feat/language-detection-custom-fallback branch from 1d0c2bc to 8fcbf39 Compare February 19, 2026 10:41
@claytonlin1110
Copy link
Copy Markdown
Contributor Author

@PastelStorm Do you want to set the target branch "feature" rather than "main"?

@claytonlin1110
Copy link
Copy Markdown
Contributor Author

@PastelStorm Please review as I'm an external contributor.

@PastelStorm
Copy link
Copy Markdown
Contributor

@PastelStorm Forking is the normal workflow on public git repo to contribute.

I apologize, you're correct here. I'll review your PR in a moment. Thank you for the contribution :)

Copy link
Copy Markdown
Contributor

@PastelStorm PastelStorm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no integration test verifying that language_fallback flows correctly through the full partition() call chain. The parameter passes through three layers of **kwargs unpacking (partition() -> partition_md() -> partition_html() -> @apply_metadata -> apply_lang_metadata() -> detect_languages()). If any intermediate function changes its signature to capture this kwarg explicitly, the feature silently breaks. An integration test like partition(filename=..., language_fallback=lambda t: None) asserting element.metadata.languages is None would guard against this.

if language_fallback is not None:
result = language_fallback(text)
return result
logger.debug(f'short text: "{text}". Defaulting to English.')
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two issues here:

  1. Unnecessary temp variable. result = language_fallback(text); return result can be simplified to return language_fallback(text).

  2. No validation of the fallback return value. The normal detection path validates language codes through _get_iso639_language_object(), but the fallback bypasses all validation. A buggy callback returning ["not_a_language"], [42], or "" would produce invalid metadata that passes silently. Consider at minimum a type/value check, or document explicitly that the caller is responsible for returning valid ISO 639-3 codes.

"""Detect language and apply it to metadata.languages for each element in `elements`.
If languages is None, default to auto detection.
If languages is and empty string, skip."""
If languages is and empty string, skip.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: and should be an


partitioning_kwargs = copy.deepcopy(kwargs)
partitioning_kwargs["detect_language_per_element"] = detect_language_per_element
partitioning_kwargs["language_fallback"] = language_fallback
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

language_fallback is added to partitioning_kwargs (used for "all other file types"), but it is not passed to the PDF/image code paths (lines 215-253). This means partition(filename="doc.pdf", language_fallback=my_fn) silently ignores the callback. If intentional, the docstring for language_fallback (lines 102-105) should mention this limitation.

Optional callable for short text (e.g. when detection defaults to English).
Called with the text; return a list of ISO 639-3 codes or None to leave
language unspecified.
pdf_infer_table_structure
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The language_fallback docstring is indented as a sub-parameter of languages, but it's actually an independent top-level parameter. This is a pre-existing style issue (from detect_language_per_element) that this PR perpetuates.

url: str | None = None,
metadata_filename: str | None = None,
metadata_last_modified: str | None = None,
languages: Optional[list[str]] = None,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent parameter exposure. languages is surfaced explicitly, but detect_language_per_element and language_fallback (which also flow through **kwargs to partition_html -> @apply_metadata) remain implicit. If the goal is to make language-related parameters discoverable, they should be treated consistently.

Comment on lines +208 to +214
def test_detect_languages_short_text_fallback_returns_custom():
"""Short ASCII text with language_fallback returns custom language."""
result = detect_languages(
text="Bonjour monde.",
language_fallback=lambda t: ["fra"],
)
assert result == ["fra"]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slightly misleading test. "Bonjour monde." is pure ASCII and has < 5 words, so the fallback is invoked purely based on length/charset, not because the text is French. The hardcoded ["fra"] return is what's being asserted. The test would behave identically with text="Hello world.". Consider using neutral text or adding a clarifying comment.

@claytonlin1110 claytonlin1110 force-pushed the feat/language-detection-custom-fallback branch from 267533b to 9da0547 Compare February 19, 2026 19:39
@claytonlin1110
Copy link
Copy Markdown
Contributor Author

@PastelStorm Fixed everything.

@PastelStorm
Copy link
Copy Markdown
Contributor

@claytonlin1110 you need to bump the project version and update the changelog file.

@claytonlin1110 claytonlin1110 force-pushed the feat/language-detection-custom-fallback branch 2 times, most recently from adecfb8 to 7507698 Compare February 19, 2026 21:42
@claytonlin1110
Copy link
Copy Markdown
Contributor Author

@claytonlin1110 you need to bump the project version and update the changelog file.

Just added and pushed

@claytonlin1110 claytonlin1110 force-pushed the feat/language-detection-custom-fallback branch from 1ce845c to cac7c0f Compare February 19, 2026 22:08
@claytonlin1110
Copy link
Copy Markdown
Contributor Author

@PastelStorm Feel free to review

@claytonlin1110
Copy link
Copy Markdown
Contributor Author

@PastelStorm Would you help me to fix lint issue? Would be great if you can guide where to see the lint error and fix it.

@PastelStorm
Copy link
Copy Markdown
Contributor

@PastelStorm Would you help me to fix lint issue? Would be great if you can guide where to see the lint error and fix it.

Take a look inside Makefile, you will find all sorts of commands there that will help you debug your code locally. For example to run a lint check you can execute either make check or make check ruff from the project root.

@claytonlin1110
Copy link
Copy Markdown
Contributor Author

@PastelStorm Done

@PastelStorm
Copy link
Copy Markdown
Contributor

@PastelStorm Done

Running the CI, hold on :)

@PastelStorm
Copy link
Copy Markdown
Contributor

@claytonlin1110 you might want to run the tests locally, so we have less of this back-and-forth. These tests will fail locally too for example:
https://github.com/Unstructured-IO/unstructured/actions/runs/22203791837/job/64224646667?pr=4238

@claytonlin1110
Copy link
Copy Markdown
Contributor Author

@claytonlin1110 you might want to run the tests locally, so we have less of this back-and-forth. These tests will fail locally too for example: https://github.com/Unstructured-IO/unstructured/actions/runs/22203791837/job/64224646667?pr=4238

By make test?

@claytonlin1110
Copy link
Copy Markdown
Contributor Author

@PastelStorm would you help fixing test?

@PastelStorm
Copy link
Copy Markdown
Contributor

@PastelStorm would you help fixing test?

I would encourage you to do the research and fix it yourself. This way you will also learn how the project works under the hood.

@claytonlin1110
Copy link
Copy Markdown
Contributor Author

@PastelStorm make test correct?

@claytonlin1110
Copy link
Copy Markdown
Contributor Author

@PastelStorm let me know the test workflow...
Even I'm on main branch, make test failed

@claytonlin1110
Copy link
Copy Markdown
Contributor Author

@PastelStorm Could you please review again?

@claytonlin1110 claytonlin1110 force-pushed the feat/language-detection-custom-fallback branch from b2ba912 to dc445a9 Compare February 23, 2026 03:16
@claytonlin1110
Copy link
Copy Markdown
Contributor Author

@PastelStorm Everything is passed, now just one issue with test_ingest_src.
Any suggestion to fix this?

@PastelStorm
Copy link
Copy Markdown
Contributor

PastelStorm commented Feb 23, 2026

@PastelStorm Everything is passed, now just one issue with test_ingest_src. Any suggestion to fix this?

You needed to regenerate the fixtures but I think my colleagues already did it in #4256 . I started a new CI workflow, let's see if it passes.

There are differences from the previously checked-in structured outputs.

If these differences are acceptable, overwrite by the fixtures by setting the env var:

  export OVERWRITE_FIXTURES=true

and then rerun this script.

NOTE: You'll likely just want to run scripts/ingest-test-fixtures-update.sh on x86_64 hardware
to update fixtures for CI.

https://github.com/Unstructured-IO/unstructured/actions/runs/22235316041/job/64373485696#step:6:9156

@PastelStorm
Copy link
Copy Markdown
Contributor

@claytonlin1110
Copy link
Copy Markdown
Contributor Author

@PastelStorm Please review again

@claytonlin1110
Copy link
Copy Markdown
Contributor Author

@PastelStorm FYI, in UDHR_first_article_all.txt.json file, languages, modes and path values are changed automatically when I run the command. But i ignored path values since it's just replacing my local project path, so i just pushed languages and modes changes only. Let me know if it's okay.

@claytonlin1110
Copy link
Copy Markdown
Contributor Author

@PastelStorm Seems like mode values shouldnt be changed. I just pushed, please re-run CI. Thanks.

@claytonlin1110
Copy link
Copy Markdown
Contributor Author

@PastelStorm All CI passed, thanks for your support.

@PastelStorm PastelStorm added this pull request to the merge queue Feb 23, 2026
Merged via the queue into Unstructured-IO:main with commit afbda95 Feb 23, 2026
51 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat/custom fallback for language detection

2 participants