[data, test] feat: add unit tests for HuggingFace dataset processors by lonexreb · Pull Request #3680 · NVIDIA-NeMo/Megatron-Bridge

lonexreb · 2026-05-05T04:41:15Z

Summary

src/megatron/bridge/data/hf_processors/ exports three pure-function dataset processors used by default_squad_config, default_gsm8k_config, and default_openmathinstruct2_config in finetune_utils.py:

process_squad_example
process_gsm8k_example (+ private _extract_final_answer helper)
process_openmathinstruct2_example

Each is a dict-in / dict-out function with no I/O and no tokenizer dependency.

They had functional tests under tests/functional_tests/test_groups/data/hf_processors/ but zero unit tests — meaning every regression in input/output formatting required a GPU container CI slot to catch.

This PR adds 22 fast unit tests across 4 classes (+287 LoC). Tests-only — no production changes.

What's covered

Class	Tests	Coverage
`TestProcessSquadExample`	5	`Context: ... Question: ... Answer:` formatting, first-answer-as-output rule, `original_answers` preservation, single-answer case, bare `Answer:` suffix, tokenizer-arg-is-no-op contract, missing-field KeyError
`TestExtractFinalAnswer`	5	extraction after `####`, whitespace stripping, no-delimiter fallback, empty-after-delimiter case, multiple-delimiter LAST-split rule
`TestProcessGsm8kExample`	4	`Question: ... Answer:` formatting, full-answer output, extracted-final-answer in `original_answers`, no-`####` flow, tokenizer-no-op
`TestProcessOpenMathInstruct2Example`	5	`Problem: ... Solution:` formatting, generated-solution as output, `expected_answer` verbatim preservation, missing-field KeyError, tokenizer-no-op
`TestProcessorOutputContract`	3 (parametrized)	Cross-processor invariant: every processor returns `{input: str, output: str, original_answers: list[str] with ≥1 element}`

Why this matters

Recipes built via _sft_common / _peft_common rely on these processors implicitly through the default_*_config helpers. A formatting regression (e.g. dropping the trailing Answer:, breaking _extract_final_answer's #### parsing) silently changes every fine-tuning recipe's input templates without the recipe-level tests noticing.

Test plan

python3 -m ast parse clean
ruff check clean
ruff format applied
CI: cicd-unit-tests-core picks up the new module under tests/unit_tests/data/

Risk

Zero — tests only.

Self-verification (lesson from #3648)

Before writing, ran git ls-files | grep hf_processors. Confirmed: only tests/functional_tests/test_groups/data/hf_processors/ exists; no tests/unit_tests/data/hf_processors/ directory present on main. This PR creates that directory.

`src/megatron/bridge/data/hf_processors/` exports three pure-function dataset processors (squad, gsm8k, openmathinstruct2) used by the `default_*_config` helpers in finetune_utils.py. Each one is a dict-in / dict-out function with no I/O and no tokenizer dependency. They had **functional tests** under `tests/functional_tests/test_groups/data/hf_processors/` but **zero unit tests** — meaning every regression in the input/output formatting required a GPU container CI slot to catch. This PR adds unit-test coverage so processor-format regressions get caught at L0-unit-test time. 22 tests across 4 classes: `TestProcessSquadExample` (5): - basic example produces documented `Context: ... Question: ... Answer:` formatting; output is the FIRST answer in the answers list; original_answers preserves all alternatives - single-answer example produces a valid output - input strictly ends with bare `Answer:` - tokenizer arg is a no-op (None and a sentinel produce identical output) — locks in the documented contract - missing required field surfaces a clear KeyError `TestExtractFinalAnswer` (5): - extracts value after `####`, strips whitespace, returns full answer stripped when no delimiter, returns empty string when `####` is followed by whitespace, uses LAST split when `####` appears multiple times `TestProcessGsm8kExample` (4): - basic example: `Question: ... Answer:` formatting; output is the full chain-of-thought; original_answers contains ONLY the extracted final numerical answer - no-`####` answer flows through `_extract_final_answer` correctly - input strictly ends with `Answer:` - tokenizer arg is a no-op `TestProcessOpenMathInstruct2Example` (5): - basic example: `Problem: ... Solution:` formatting; output is the generated solution; original_answers wraps expected_answer in a one-element list - input strictly ends with `Solution:` - expected_answer is preserved verbatim (no stripping) - tokenizer arg is a no-op - missing required field surfaces KeyError `TestProcessorOutputContract` (3, parametrized): - cross-processor invariant: every processor returns a dict with `input` (str), `output` (str), and `original_answers` (list[str] with at least one element) Tests-only — no production code changes. Locks in the formatting contract every recipe that uses these processors implicitly depends on. Signed-off-by: lonexreb <[email protected]>

copy-pr-bot · 2026-05-05T04:41:18Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

cuichenx

these tests already exist in tests/functional_tests/, but they actually don't use GPUs. it's better to move them instead of creating duplicate tests.

github-actions Bot added the community-request label May 5, 2026

cuichenx reviewed May 6, 2026

View reviewed changes

svcnvidia-nemo-ci added the waiting-on-customer Waiting on the original author to respond label May 6, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[data, test] feat: add unit tests for HuggingFace dataset processors#3680

[data, test] feat: add unit tests for HuggingFace dataset processors#3680
lonexreb wants to merge 1 commit intoNVIDIA-NeMo:mainfrom
lonexreb:training/test-hf-processors

lonexreb commented May 5, 2026

Uh oh!

copy-pr-bot Bot commented May 5, 2026

Uh oh!

cuichenx left a comment •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

lonexreb commented May 5, 2026

Summary

What's covered

Why this matters

Test plan

Risk

Self-verification (lesson from #3648)

Uh oh!

copy-pr-bot Bot commented May 5, 2026

Uh oh!

cuichenx left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cuichenx left a comment •

edited

Loading