Skip to content

[data] fix: pad chat tensors and loss_mask in pre_pad_dataset (#2610)#3665

Open
lonexreb wants to merge 2 commits intoNVIDIA-NeMo:mainfrom
lonexreb:fix/packed-chat-pad-seq-mult-2610
Open

[data] fix: pad chat tensors and loss_mask in pre_pad_dataset (#2610)#3665
lonexreb wants to merge 2 commits intoNVIDIA-NeMo:mainfrom
lonexreb:fix/packed-chat-pad-seq-mult-2610

Conversation

@lonexreb
Copy link
Copy Markdown
Contributor

@lonexreb lonexreb commented May 4, 2026

Summary

Fixes #2610tokenize_dataset raises when packing chat datasets with pad_seq_to_mult > 1.

Two distinct bugs in pre_pad_dataset (inside src/megatron/bridge/data/datasets/packed_sequence.py) interacted to break this path:

  1. Tensor / list mismatch. _chat_preprocess (and the legacy _preprocess) return torch.LongTensor / torch.BoolTensor, but the padding logic concatenated them with a Python list (val + [pad_id] * N), which raises TypeError for tensors.
  2. Missing loss_mask padding. Only input_ids and context_ids were padded. loss_mask kept its original length, so when create_hist grouped two items by their padded input_ids length, fill_packing_strategy later hit ValueError: setting an array element with a sequence ... inhomogeneous shape on np.array([x['loss_mask'] for x in per_seq_data]).

The failure surfaces only when chat=True and pad_seq_to_mult > 1pad_seq_to_mult == 1 and the non-chat GPTSFTDataset path are unaffected.

Fix

src/megatron/bridge/data/datasets/packed_sequence.py:121-167

  • Normalize each value to a Python list via tolist() before concatenation, so tensor and list inputs share the same padding path.
  • Pad input_ids and context_ids with pad_id; pad loss_mask with False so padded positions never contribute to the training loss.
  • Iterate from a fixed _PAD_VALUES map keyed {input_ids, context_ids, loss_mask} and skip keys missing from data — preserves the non-chat path which has no loss_mask.

The diff is +44 / -13 in the source file; the public function signature is unchanged.

Tests

tests/unit_tests/data/datasets/test_packed_chat_padding.py (new, 233 LoC) covers the full bug surface:

Test What it pins
test_torch_tensor_inputs_do_not_raise Bug 1: tensor inputs no longer raise TypeError during padding
test_loss_mask_is_padded_to_input_ids_length Bug 2: bin-mates with different original lengths now share equal padded length
test_loss_mask_padding_uses_false Padded loss_mask positions are False (no spurious loss on pad)
test_context_ids_padded_with_pad_id context_ids continues to pad with pad_id, like input_ids
test_non_chat_list_inputs_still_padded Non-chat path (lists, no loss_mask) still works; we don't invent loss_mask
test_pad_seq_to_mult_one_skips_padding pad_seq_to_mult=1 bypass is preserved (tensors pass through unchanged)
test_create_hist_then_fill_packing_strategy_no_error End-to-end round-trip through create_histfill_packing_strategy that previously raised ValueError: inhomogeneous shape

Implementation notes for reviewers:

  • All tests use unittest.mock to stub create_sft_dataset, so they run on CPU and don't need a real tokenizer or HF Hub access.
  • Sequence lengths are tiny (3-8 tokens) and max_seq_length=64, keeping the suite well under the L0 budget.
  • The end-to-end test deliberately constructs two items of original length 3 and 5 — both round to max_length_to_pad=8 and land in the same create_hist bucket. Before the fix, this is the exact configuration that triggered the inhomogeneous-shape error reported in tokenize_dataset crashes with TypeError when chat=True and pad_seq_to_mult > 1 #2610.

Test plan

  • ruff check and ruff format --check pass on both files
  • Standalone simulation of the padding logic verified all invariants pre-CI (cannot run pytest locally on macOS without the Megatron-LM submodule + GPU deps; CI will validate)
  • CI L0 unit tests — runs tests/unit_tests/data/datasets/test_packed_chat_padding.py automatically (test discovery is path-based)

Risk

Zero production risk for users who don't combine chat=True with pad_seq_to_mult > 1 — those code paths are untouched. For the affected combination, behavior changes from "raises TypeError / ValueError" to "produces a correctly padded packed dataset" — no silent behavior change.

Notes for reviewers

  • I considered limiting the change to converting tensors to lists in the existing {input_ids, context_ids} set, but that would leave Bug 2 unfixed: bin-mates with mismatched loss_mask lengths still crash inside fill_packing_strategy. Both fixes are required for the reported repro.
  • loss_mask pad value: False is consistent with how _collate_item treats the padding region (pad_id=0) inside GPTSFTChatDataset.collate_fn (src/megatron/bridge/data/datasets/sft.py:1261), and with the non-chat _build_loss_mask path which uses 0.0/1.0 floats.

…-NeMo#2610)

`tokenize_dataset` raised when packing chat datasets with `pad_seq_to_mult > 1`:

1. `_chat_preprocess` (and the legacy `_preprocess`) return `torch.LongTensor` /
   `torch.BoolTensor`, but `pre_pad_dataset` extended values via `val + [pad_id] * N`,
   which raises `TypeError` for tensors.
2. Only `input_ids` and `context_ids` were padded. `loss_mask` was left at its
   original length, so when `create_hist` grouped two items by their padded
   `input_ids` length, `fill_packing_strategy` later hit
   `ValueError: setting an array element with a sequence ... inhomogeneous shape`
   on `np.array([x["loss_mask"] for x in per_seq_data])`.

Fix:
- Normalize each value to a Python list via `tolist()` before concatenation, so
  tensor and list inputs share the same padding path.
- Pad `input_ids` and `context_ids` with `pad_id`; pad `loss_mask` with `False`
  so padded positions never contribute to the training loss.
- Skip keys missing from `data` (preserves the non-chat path which has no
  `loss_mask`).

Tests:
- `test_packed_chat_padding.py` exercises the full bug surface: tensor inputs,
  bin-mate alignment of `loss_mask`, fill value semantics, non-chat list path,
  the `pad_seq_to_mult=1` bypass, and an end-to-end `create_hist` →
  `fill_packing_strategy` round-trip that previously raised.

Signed-off-by: lonexreb <[email protected]>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 4, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@cuichenx
Copy link
Copy Markdown
Contributor

cuichenx commented May 6, 2026

/ok to test 8572aa6

@cuichenx cuichenx added the needs-review PR is ready for code review and waiting on a reviewer label May 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-request needs-review PR is ready for code review and waiting on a reviewer

Projects

None yet

Development

Successfully merging this pull request may close these issues.

tokenize_dataset crashes with TypeError when chat=True and pad_seq_to_mult > 1

3 participants