Skip to content

[data, test] feat: add unit tests for HuggingFace dataset processors#3680

Open
lonexreb wants to merge 1 commit intoNVIDIA-NeMo:mainfrom
lonexreb:training/test-hf-processors
Open

[data, test] feat: add unit tests for HuggingFace dataset processors#3680
lonexreb wants to merge 1 commit intoNVIDIA-NeMo:mainfrom
lonexreb:training/test-hf-processors

Conversation

@lonexreb
Copy link
Copy Markdown
Contributor

@lonexreb lonexreb commented May 5, 2026

Summary

src/megatron/bridge/data/hf_processors/ exports three pure-function dataset processors used by default_squad_config, default_gsm8k_config, and default_openmathinstruct2_config in finetune_utils.py:

  • process_squad_example
  • process_gsm8k_example (+ private _extract_final_answer helper)
  • process_openmathinstruct2_example

Each is a dict-in / dict-out function with no I/O and no tokenizer dependency.

They had functional tests under tests/functional_tests/test_groups/data/hf_processors/ but zero unit tests — meaning every regression in input/output formatting required a GPU container CI slot to catch.

This PR adds 22 fast unit tests across 4 classes (+287 LoC). Tests-only — no production changes.

What's covered

Class Tests Coverage
TestProcessSquadExample 5 Context: ... Question: ... Answer: formatting, first-answer-as-output rule, original_answers preservation, single-answer case, bare Answer: suffix, tokenizer-arg-is-no-op contract, missing-field KeyError
TestExtractFinalAnswer 5 extraction after ####, whitespace stripping, no-delimiter fallback, empty-after-delimiter case, multiple-delimiter LAST-split rule
TestProcessGsm8kExample 4 Question: ... Answer: formatting, full-answer output, extracted-final-answer in original_answers, no-#### flow, tokenizer-no-op
TestProcessOpenMathInstruct2Example 5 Problem: ... Solution: formatting, generated-solution as output, expected_answer verbatim preservation, missing-field KeyError, tokenizer-no-op
TestProcessorOutputContract 3 (parametrized) Cross-processor invariant: every processor returns {input: str, output: str, original_answers: list[str] with ≥1 element}

Why this matters

Recipes built via _sft_common / _peft_common rely on these processors implicitly through the default_*_config helpers. A formatting regression (e.g. dropping the trailing Answer:, breaking _extract_final_answer's #### parsing) silently changes every fine-tuning recipe's input templates without the recipe-level tests noticing.

Test plan

  • python3 -m ast parse clean
  • ruff check clean
  • ruff format applied
  • CI: cicd-unit-tests-core picks up the new module under tests/unit_tests/data/

Risk

Zero — tests only.

Self-verification (lesson from #3648)

Before writing, ran git ls-files | grep hf_processors. Confirmed: only tests/functional_tests/test_groups/data/hf_processors/ exists; no tests/unit_tests/data/hf_processors/ directory present on main. This PR creates that directory.

`src/megatron/bridge/data/hf_processors/` exports three pure-function
dataset processors (squad, gsm8k, openmathinstruct2) used by the
`default_*_config` helpers in finetune_utils.py. Each one is a
dict-in / dict-out function with no I/O and no tokenizer dependency.

They had **functional tests** under
`tests/functional_tests/test_groups/data/hf_processors/` but **zero
unit tests** — meaning every regression in the input/output formatting
required a GPU container CI slot to catch.

This PR adds unit-test coverage so processor-format regressions get
caught at L0-unit-test time. 22 tests across 4 classes:

`TestProcessSquadExample` (5):
  - basic example produces documented `Context: ... Question: ...
    Answer:` formatting; output is the FIRST answer in the answers
    list; original_answers preserves all alternatives
  - single-answer example produces a valid output
  - input strictly ends with bare `Answer:`
  - tokenizer arg is a no-op (None and a sentinel produce identical
    output) — locks in the documented contract
  - missing required field surfaces a clear KeyError

`TestExtractFinalAnswer` (5):
  - extracts value after `####`, strips whitespace, returns full
    answer stripped when no delimiter, returns empty string when
    `####` is followed by whitespace, uses LAST split when `####`
    appears multiple times

`TestProcessGsm8kExample` (4):
  - basic example: `Question: ... Answer:` formatting; output is the
    full chain-of-thought; original_answers contains ONLY the
    extracted final numerical answer
  - no-`####` answer flows through `_extract_final_answer` correctly
  - input strictly ends with `Answer:`
  - tokenizer arg is a no-op

`TestProcessOpenMathInstruct2Example` (5):
  - basic example: `Problem: ... Solution:` formatting; output is the
    generated solution; original_answers wraps expected_answer in a
    one-element list
  - input strictly ends with `Solution:`
  - expected_answer is preserved verbatim (no stripping)
  - tokenizer arg is a no-op
  - missing required field surfaces KeyError

`TestProcessorOutputContract` (3, parametrized):
  - cross-processor invariant: every processor returns a dict with
    `input` (str), `output` (str), and `original_answers` (list[str]
    with at least one element)

Tests-only — no production code changes. Locks in the formatting
contract every recipe that uses these processors implicitly depends on.

Signed-off-by: lonexreb <[email protected]>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 5, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Copy link
Copy Markdown
Contributor

@cuichenx cuichenx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these tests already exist in tests/functional_tests/, but they actually don't use GPUs. it's better to move them instead of creating duplicate tests.

@svcnvidia-nemo-ci svcnvidia-nemo-ci added the waiting-on-customer Waiting on the original author to respond label May 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-request waiting-on-customer Waiting on the original author to respond

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants