[data] fix: pad chat tensors and loss_mask in pre_pad_dataset (#2610)#3665
Open
lonexreb wants to merge 2 commits intoNVIDIA-NeMo:mainfrom
Open
[data] fix: pad chat tensors and loss_mask in pre_pad_dataset (#2610)#3665lonexreb wants to merge 2 commits intoNVIDIA-NeMo:mainfrom
lonexreb wants to merge 2 commits intoNVIDIA-NeMo:mainfrom
Conversation
…-NeMo#2610) `tokenize_dataset` raised when packing chat datasets with `pad_seq_to_mult > 1`: 1. `_chat_preprocess` (and the legacy `_preprocess`) return `torch.LongTensor` / `torch.BoolTensor`, but `pre_pad_dataset` extended values via `val + [pad_id] * N`, which raises `TypeError` for tensors. 2. Only `input_ids` and `context_ids` were padded. `loss_mask` was left at its original length, so when `create_hist` grouped two items by their padded `input_ids` length, `fill_packing_strategy` later hit `ValueError: setting an array element with a sequence ... inhomogeneous shape` on `np.array([x["loss_mask"] for x in per_seq_data])`. Fix: - Normalize each value to a Python list via `tolist()` before concatenation, so tensor and list inputs share the same padding path. - Pad `input_ids` and `context_ids` with `pad_id`; pad `loss_mask` with `False` so padded positions never contribute to the training loss. - Skip keys missing from `data` (preserves the non-chat path which has no `loss_mask`). Tests: - `test_packed_chat_padding.py` exercises the full bug surface: tensor inputs, bin-mate alignment of `loss_mask`, fill value semantics, non-chat list path, the `pad_seq_to_mult=1` bypass, and an end-to-end `create_hist` → `fill_packing_strategy` round-trip that previously raised. Signed-off-by: lonexreb <[email protected]>
Contributor
|
/ok to test 8572aa6 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #2610 —
tokenize_datasetraises when packing chat datasets withpad_seq_to_mult > 1.Two distinct bugs in
pre_pad_dataset(insidesrc/megatron/bridge/data/datasets/packed_sequence.py) interacted to break this path:_chat_preprocess(and the legacy_preprocess) returntorch.LongTensor/torch.BoolTensor, but the padding logic concatenated them with a Python list (val + [pad_id] * N), which raisesTypeErrorfor tensors.loss_maskpadding. Onlyinput_idsandcontext_idswere padded.loss_maskkept its original length, so whencreate_histgrouped two items by their paddedinput_idslength,fill_packing_strategylater hitValueError: setting an array element with a sequence ... inhomogeneous shapeonnp.array([x['loss_mask'] for x in per_seq_data]).The failure surfaces only when
chat=Trueandpad_seq_to_mult > 1—pad_seq_to_mult == 1and the non-chatGPTSFTDatasetpath are unaffected.Fix
src/megatron/bridge/data/datasets/packed_sequence.py:121-167tolist()before concatenation, so tensor and list inputs share the same padding path.input_idsandcontext_idswithpad_id; padloss_maskwithFalseso padded positions never contribute to the training loss._PAD_VALUESmap keyed{input_ids, context_ids, loss_mask}and skip keys missing fromdata— preserves the non-chat path which has noloss_mask.The diff is +44 / -13 in the source file; the public function signature is unchanged.
Tests
tests/unit_tests/data/datasets/test_packed_chat_padding.py(new, 233 LoC) covers the full bug surface:test_torch_tensor_inputs_do_not_raiseTypeErrorduring paddingtest_loss_mask_is_padded_to_input_ids_lengthtest_loss_mask_padding_uses_falseFalse(no spurious loss on pad)test_context_ids_padded_with_pad_idcontext_idscontinues to pad withpad_id, likeinput_idstest_non_chat_list_inputs_still_paddedloss_mask) still works; we don't inventloss_masktest_pad_seq_to_mult_one_skips_paddingpad_seq_to_mult=1bypass is preserved (tensors pass through unchanged)test_create_hist_then_fill_packing_strategy_no_errorcreate_hist→fill_packing_strategythat previously raisedValueError: inhomogeneous shapeImplementation notes for reviewers:
unittest.mockto stubcreate_sft_dataset, so they run on CPU and don't need a real tokenizer or HF Hub access.max_seq_length=64, keeping the suite well under the L0 budget.max_length_to_pad=8and land in the samecreate_histbucket. Before the fix, this is the exact configuration that triggered the inhomogeneous-shape error reported in tokenize_dataset crashes with TypeError when chat=True and pad_seq_to_mult > 1 #2610.Test plan
ruff checkandruff format --checkpass on both filestests/unit_tests/data/datasets/test_packed_chat_padding.pyautomatically (test discovery is path-based)Risk
Zero production risk for users who don't combine
chat=Truewithpad_seq_to_mult > 1— those code paths are untouched. For the affected combination, behavior changes from "raisesTypeError/ValueError" to "produces a correctly padded packed dataset" — no silent behavior change.Notes for reviewers
{input_ids, context_ids}set, but that would leave Bug 2 unfixed: bin-mates with mismatchedloss_masklengths still crash insidefill_packing_strategy. Both fixes are required for the reported repro.loss_maskpad value:Falseis consistent with how_collate_itemtreats the padding region (pad_id=0) insideGPTSFTChatDataset.collate_fn(src/megatron/bridge/data/datasets/sft.py:1261), and with the non-chat_build_loss_maskpath which uses0.0/1.0floats.