[data, test] feat: add unit tests for HuggingFace dataset processors#3680
Open
lonexreb wants to merge 1 commit intoNVIDIA-NeMo:mainfrom
Open
[data, test] feat: add unit tests for HuggingFace dataset processors#3680lonexreb wants to merge 1 commit intoNVIDIA-NeMo:mainfrom
lonexreb wants to merge 1 commit intoNVIDIA-NeMo:mainfrom
Conversation
`src/megatron/bridge/data/hf_processors/` exports three pure-function
dataset processors (squad, gsm8k, openmathinstruct2) used by the
`default_*_config` helpers in finetune_utils.py. Each one is a
dict-in / dict-out function with no I/O and no tokenizer dependency.
They had **functional tests** under
`tests/functional_tests/test_groups/data/hf_processors/` but **zero
unit tests** — meaning every regression in the input/output formatting
required a GPU container CI slot to catch.
This PR adds unit-test coverage so processor-format regressions get
caught at L0-unit-test time. 22 tests across 4 classes:
`TestProcessSquadExample` (5):
- basic example produces documented `Context: ... Question: ...
Answer:` formatting; output is the FIRST answer in the answers
list; original_answers preserves all alternatives
- single-answer example produces a valid output
- input strictly ends with bare `Answer:`
- tokenizer arg is a no-op (None and a sentinel produce identical
output) — locks in the documented contract
- missing required field surfaces a clear KeyError
`TestExtractFinalAnswer` (5):
- extracts value after `####`, strips whitespace, returns full
answer stripped when no delimiter, returns empty string when
`####` is followed by whitespace, uses LAST split when `####`
appears multiple times
`TestProcessGsm8kExample` (4):
- basic example: `Question: ... Answer:` formatting; output is the
full chain-of-thought; original_answers contains ONLY the
extracted final numerical answer
- no-`####` answer flows through `_extract_final_answer` correctly
- input strictly ends with `Answer:`
- tokenizer arg is a no-op
`TestProcessOpenMathInstruct2Example` (5):
- basic example: `Problem: ... Solution:` formatting; output is the
generated solution; original_answers wraps expected_answer in a
one-element list
- input strictly ends with `Solution:`
- expected_answer is preserved verbatim (no stripping)
- tokenizer arg is a no-op
- missing required field surfaces KeyError
`TestProcessorOutputContract` (3, parametrized):
- cross-processor invariant: every processor returns a dict with
`input` (str), `output` (str), and `original_answers` (list[str]
with at least one element)
Tests-only — no production code changes. Locks in the formatting
contract every recipe that uses these processors implicitly depends on.
Signed-off-by: lonexreb <[email protected]>
cuichenx
reviewed
May 6, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
src/megatron/bridge/data/hf_processors/exports three pure-function dataset processors used bydefault_squad_config,default_gsm8k_config, anddefault_openmathinstruct2_configinfinetune_utils.py:process_squad_exampleprocess_gsm8k_example(+ private_extract_final_answerhelper)process_openmathinstruct2_exampleEach is a dict-in / dict-out function with no I/O and no tokenizer dependency.
They had functional tests under
tests/functional_tests/test_groups/data/hf_processors/but zero unit tests — meaning every regression in input/output formatting required a GPU container CI slot to catch.This PR adds 22 fast unit tests across 4 classes (+287 LoC). Tests-only — no production changes.
What's covered
TestProcessSquadExampleContext: ... Question: ... Answer:formatting, first-answer-as-output rule,original_answerspreservation, single-answer case, bareAnswer:suffix, tokenizer-arg-is-no-op contract, missing-field KeyErrorTestExtractFinalAnswer####, whitespace stripping, no-delimiter fallback, empty-after-delimiter case, multiple-delimiter LAST-split ruleTestProcessGsm8kExampleQuestion: ... Answer:formatting, full-answer output, extracted-final-answer inoriginal_answers, no-####flow, tokenizer-no-opTestProcessOpenMathInstruct2ExampleProblem: ... Solution:formatting, generated-solution as output,expected_answerverbatim preservation, missing-field KeyError, tokenizer-no-opTestProcessorOutputContract{input: str, output: str, original_answers: list[str] with ≥1 element}Why this matters
Recipes built via
_sft_common/_peft_commonrely on these processors implicitly through thedefault_*_confighelpers. A formatting regression (e.g. dropping the trailingAnswer:, breaking_extract_final_answer's####parsing) silently changes every fine-tuning recipe's input templates without the recipe-level tests noticing.Test plan
python3 -m astparse cleanruff checkcleanruff formatappliedcicd-unit-tests-corepicks up the new module undertests/unit_tests/data/Risk
Zero — tests only.
Self-verification (lesson from #3648)
Before writing, ran
git ls-files | grep hf_processors. Confirmed: onlytests/functional_tests/test_groups/data/hf_processors/exists; notests/unit_tests/data/hf_processors/directory present on main. This PR creates that directory.