feat: custom fallback for language detection#4238
Conversation
|
@badGarnet Would you please review this since this is my first contribution here. |
|
@badGarnet Sorry for pinging you since I'm a new contributor... Please review |
|
@PastelStorm I'm a new contributor here i don't know who i need to ping to get this PR reviewed. |
Would you mind making a feature branch instead of forking the repo please? Then I will review. Thank you. |
|
@PastelStorm Forking is the normal workflow on public git repo to contribute. |
1d0c2bc to
8fcbf39
Compare
|
@PastelStorm Do you want to set the target branch "feature" rather than "main"? |
|
@PastelStorm Please review as I'm an external contributor. |
I apologize, you're correct here. I'll review your PR in a moment. Thank you for the contribution :) |
PastelStorm
left a comment
There was a problem hiding this comment.
There is no integration test verifying that language_fallback flows correctly through the full partition() call chain. The parameter passes through three layers of **kwargs unpacking (partition() -> partition_md() -> partition_html() -> @apply_metadata -> apply_lang_metadata() -> detect_languages()). If any intermediate function changes its signature to capture this kwarg explicitly, the feature silently breaks. An integration test like partition(filename=..., language_fallback=lambda t: None) asserting element.metadata.languages is None would guard against this.
| if language_fallback is not None: | ||
| result = language_fallback(text) | ||
| return result | ||
| logger.debug(f'short text: "{text}". Defaulting to English.') |
There was a problem hiding this comment.
Two issues here:
-
Unnecessary temp variable.
result = language_fallback(text); return resultcan be simplified toreturn language_fallback(text). -
No validation of the fallback return value. The normal detection path validates language codes through
_get_iso639_language_object(), but the fallback bypasses all validation. A buggy callback returning["not_a_language"],[42], or""would produce invalid metadata that passes silently. Consider at minimum a type/value check, or document explicitly that the caller is responsible for returning valid ISO 639-3 codes.
| """Detect language and apply it to metadata.languages for each element in `elements`. | ||
| If languages is None, default to auto detection. | ||
| If languages is and empty string, skip.""" | ||
| If languages is and empty string, skip. |
There was a problem hiding this comment.
typo: and should be an
|
|
||
| partitioning_kwargs = copy.deepcopy(kwargs) | ||
| partitioning_kwargs["detect_language_per_element"] = detect_language_per_element | ||
| partitioning_kwargs["language_fallback"] = language_fallback |
There was a problem hiding this comment.
language_fallback is added to partitioning_kwargs (used for "all other file types"), but it is not passed to the PDF/image code paths (lines 215-253). This means partition(filename="doc.pdf", language_fallback=my_fn) silently ignores the callback. If intentional, the docstring for language_fallback (lines 102-105) should mention this limitation.
| Optional callable for short text (e.g. when detection defaults to English). | ||
| Called with the text; return a list of ISO 639-3 codes or None to leave | ||
| language unspecified. | ||
| pdf_infer_table_structure |
There was a problem hiding this comment.
The language_fallback docstring is indented as a sub-parameter of languages, but it's actually an independent top-level parameter. This is a pre-existing style issue (from detect_language_per_element) that this PR perpetuates.
| url: str | None = None, | ||
| metadata_filename: str | None = None, | ||
| metadata_last_modified: str | None = None, | ||
| languages: Optional[list[str]] = None, |
There was a problem hiding this comment.
Inconsistent parameter exposure. languages is surfaced explicitly, but detect_language_per_element and language_fallback (which also flow through **kwargs to partition_html -> @apply_metadata) remain implicit. If the goal is to make language-related parameters discoverable, they should be treated consistently.
| def test_detect_languages_short_text_fallback_returns_custom(): | ||
| """Short ASCII text with language_fallback returns custom language.""" | ||
| result = detect_languages( | ||
| text="Bonjour monde.", | ||
| language_fallback=lambda t: ["fra"], | ||
| ) | ||
| assert result == ["fra"] |
There was a problem hiding this comment.
Slightly misleading test. "Bonjour monde." is pure ASCII and has < 5 words, so the fallback is invoked purely based on length/charset, not because the text is French. The hardcoded ["fra"] return is what's being asserted. The test would behave identically with text="Hello world.". Consider using neutral text or adding a clarifying comment.
267533b to
9da0547
Compare
|
@PastelStorm Fixed everything. |
|
@claytonlin1110 you need to bump the project version and update the changelog file. |
adecfb8 to
7507698
Compare
Just added and pushed |
1ce845c to
cac7c0f
Compare
|
@PastelStorm Feel free to review |
|
@PastelStorm Would you help me to fix lint issue? Would be great if you can guide where to see the lint error and fix it. |
Take a look inside Makefile, you will find all sorts of commands there that will help you debug your code locally. For example to run a lint check you can execute either |
|
@PastelStorm Done |
Running the CI, hold on :) |
|
@claytonlin1110 you might want to run the tests locally, so we have less of this back-and-forth. These tests will fail locally too for example: |
By make test? |
|
@PastelStorm would you help fixing test? |
I would encourage you to do the research and fix it yourself. This way you will also learn how the project works under the hood. |
|
@PastelStorm make test correct? |
|
@PastelStorm let me know the test workflow... |
|
@PastelStorm Could you please review again? |
b2ba912 to
dc445a9
Compare
|
@PastelStorm Everything is passed, now just one issue with test_ingest_src. |
You needed to regenerate the fixtures but I think my colleagues already did it in #4256 . I started a new CI workflow, let's see if it passes. |
|
@claytonlin1110 same issue, you need to regenerate the fixtures first, then you can run the tests: |
|
@PastelStorm Please review again |
|
@PastelStorm FYI, in UDHR_first_article_all.txt.json file, languages, modes and path values are changed automatically when I run the command. But i ignored path values since it's just replacing my local project path, so i just pushed languages and modes changes only. Let me know if it's okay. |
|
@PastelStorm Seems like mode values shouldnt be changed. I just pushed, please re-run CI. Thanks. |
|
@PastelStorm All CI passed, thanks for your support. |
Closes #4091
Implements custom fallback for language detection so short text is not forced to English and callers can control or disable detection.
Changes:
Optional callable used when text is short (<5 words) and ASCII. It receives the text and can return a list of ISO 639-3 codes or None to leave language unspecified. If not provided, short text still defaults to ["eng"] (backward compatible).
New parameter language_fallback; applied in the short-text path only.
New parameter language_fallback; passed through to all partitioners via the metadata decorator.
New parameter languages so callers can pass languages=[""] to disable language detection (aligned with other partitioners).
Usage: