[data] fix: pad chat tensors and loss_mask in pre_pad_dataset (#2610) by lonexreb · Pull Request #3665 · NVIDIA-NeMo/Megatron-Bridge

lonexreb · 2026-05-04T13:54:25Z

Summary

Fixes #2610 — tokenize_dataset raises when packing chat datasets with pad_seq_to_mult > 1.

Two distinct bugs in pre_pad_dataset (inside src/megatron/bridge/data/datasets/packed_sequence.py) interacted to break this path:

Tensor / list mismatch. _chat_preprocess (and the legacy _preprocess) return torch.LongTensor / torch.BoolTensor, but the padding logic concatenated them with a Python list (val + [pad_id] * N), which raises TypeError for tensors.
Missing loss_mask padding. Only input_ids and context_ids were padded. loss_mask kept its original length, so when create_hist grouped two items by their padded input_ids length, fill_packing_strategy later hit ValueError: setting an array element with a sequence ... inhomogeneous shape on np.array([x['loss_mask'] for x in per_seq_data]).

The failure surfaces only when chat=True and pad_seq_to_mult > 1 — pad_seq_to_mult == 1 and the non-chat GPTSFTDataset path are unaffected.

Fix

src/megatron/bridge/data/datasets/packed_sequence.py:121-167

Normalize each value to a Python list via tolist() before concatenation, so tensor and list inputs share the same padding path.
Pad input_ids and context_ids with pad_id; pad loss_mask with False so padded positions never contribute to the training loss.
Iterate from a fixed _PAD_VALUES map keyed {input_ids, context_ids, loss_mask} and skip keys missing from data — preserves the non-chat path which has no loss_mask.

The diff is +44 / -13 in the source file; the public function signature is unchanged.

Tests

tests/unit_tests/data/datasets/test_packed_chat_padding.py (new, 233 LoC) covers the full bug surface:

Test	What it pins
`test_torch_tensor_inputs_do_not_raise`	Bug 1: tensor inputs no longer raise `TypeError` during padding
`test_loss_mask_is_padded_to_input_ids_length`	Bug 2: bin-mates with different original lengths now share equal padded length
`test_loss_mask_padding_uses_false`	Padded loss_mask positions are `False` (no spurious loss on pad)
`test_context_ids_padded_with_pad_id`	`context_ids` continues to pad with `pad_id`, like `input_ids`
`test_non_chat_list_inputs_still_padded`	Non-chat path (lists, no `loss_mask`) still works; we don't invent `loss_mask`
`test_pad_seq_to_mult_one_skips_padding`	`pad_seq_to_mult=1` bypass is preserved (tensors pass through unchanged)
`test_create_hist_then_fill_packing_strategy_no_error`	End-to-end round-trip through `create_hist` → `fill_packing_strategy` that previously raised `ValueError: inhomogeneous shape`

Implementation notes for reviewers:

All tests use unittest.mock to stub create_sft_dataset, so they run on CPU and don't need a real tokenizer or HF Hub access.
Sequence lengths are tiny (3-8 tokens) and max_seq_length=64, keeping the suite well under the L0 budget.
The end-to-end test deliberately constructs two items of original length 3 and 5 — both round to max_length_to_pad=8 and land in the same create_hist bucket. Before the fix, this is the exact configuration that triggered the inhomogeneous-shape error reported in tokenize_dataset crashes with TypeError when chat=True and pad_seq_to_mult > 1 #2610.

Test plan

ruff check and ruff format --check pass on both files
Standalone simulation of the padding logic verified all invariants pre-CI (cannot run pytest locally on macOS without the Megatron-LM submodule + GPU deps; CI will validate)
CI L0 unit tests — runs tests/unit_tests/data/datasets/test_packed_chat_padding.py automatically (test discovery is path-based)

Risk

Zero production risk for users who don't combine chat=True with pad_seq_to_mult > 1 — those code paths are untouched. For the affected combination, behavior changes from "raises TypeError / ValueError" to "produces a correctly padded packed dataset" — no silent behavior change.

Notes for reviewers

I considered limiting the change to converting tensors to lists in the existing {input_ids, context_ids} set, but that would leave Bug 2 unfixed: bin-mates with mismatched loss_mask lengths still crash inside fill_packing_strategy. Both fixes are required for the reported repro.
loss_mask pad value: False is consistent with how _collate_item treats the padding region (pad_id=0) inside GPTSFTChatDataset.collate_fn (src/megatron/bridge/data/datasets/sft.py:1261), and with the non-chat _build_loss_mask path which uses 0.0/1.0 floats.

…-NeMo#2610) `tokenize_dataset` raised when packing chat datasets with `pad_seq_to_mult > 1`: 1. `_chat_preprocess` (and the legacy `_preprocess`) return `torch.LongTensor` / `torch.BoolTensor`, but `pre_pad_dataset` extended values via `val + [pad_id] * N`, which raises `TypeError` for tensors. 2. Only `input_ids` and `context_ids` were padded. `loss_mask` was left at its original length, so when `create_hist` grouped two items by their padded `input_ids` length, `fill_packing_strategy` later hit `ValueError: setting an array element with a sequence ... inhomogeneous shape` on `np.array([x["loss_mask"] for x in per_seq_data])`. Fix: - Normalize each value to a Python list via `tolist()` before concatenation, so tensor and list inputs share the same padding path. - Pad `input_ids` and `context_ids` with `pad_id`; pad `loss_mask` with `False` so padded positions never contribute to the training loss. - Skip keys missing from `data` (preserves the non-chat path which has no `loss_mask`). Tests: - `test_packed_chat_padding.py` exercises the full bug surface: tensor inputs, bin-mate alignment of `loss_mask`, fill value semantics, non-chat list path, the `pad_seq_to_mult=1` bypass, and an end-to-end `create_hist` → `fill_packing_strategy` round-trip that previously raised. Signed-off-by: lonexreb <[email protected]>

copy-pr-bot · 2026-05-04T13:54:29Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

cuichenx · 2026-05-06T00:00:14Z

/ok to test 8572aa6

github-actions Bot added the community-request label May 4, 2026

cuichenx requested a review from asolergi-nv May 5, 2026 23:54

Merge branch 'main' into fix/packed-chat-pad-seq-mult-2610

8572aa6

copy-pr-bot Bot temporarily deployed to test May 6, 2026 00:00 Inactive

cuichenx added the needs-review PR is ready for code review and waiting on a reviewer label May 6, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[data] fix: pad chat tensors and loss_mask in pre_pad_dataset (#2610)#3665

[data] fix: pad chat tensors and loss_mask in pre_pad_dataset (#2610)#3665
lonexreb wants to merge 2 commits intoNVIDIA-NeMo:mainfrom
lonexreb:fix/packed-chat-pad-seq-mult-2610

lonexreb commented May 4, 2026

Uh oh!

copy-pr-bot Bot commented May 4, 2026

Uh oh!

cuichenx commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

lonexreb commented May 4, 2026

Summary

Fix

Tests

Test plan

Risk

Notes for reviewers

Uh oh!

copy-pr-bot Bot commented May 4, 2026

Uh oh!

cuichenx commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants