Describe the bug
tokenize_dataset() in packed_sequence.py fails when chat=True and pad_seq_to_mult > 1:
-
Tensor/list mismatch: _chat_preprocess returns torch.LongTensor/torch.BoolTensor, but pre_pad_dataset concatenates with plain lists (val + [pad_id] * ...), raising TypeError.
-
Missing loss_mask padding: pre_pad_dataset pads input_ids and context_ids but not loss_mask. Sequences with different original lengths can round to the same padded input_ids length, so create_hist groups them together — but their loss_mask arrays differ in length, causing np.array() in fill_packing_strategy to fail with ValueError: inhomogeneous shape.
Steps/Code to reproduce bug
from megatron.bridge.data.builders.finetuning_dataset import FinetuningDatasetBuilder
from megatron.bridge.data.datasets.packed_sequence import PackedSequenceSpecs
builder = FinetuningDatasetBuilder(
dataset_root="path/to/jsonl",
tokenizer=tokenizer,
seq_length=221184,
packed_sequence_specs=PackedSequenceSpecs(
packed_sequence_size=221184,
tokenizer_model_name="qwen3-14b",
pad_seq_to_mult=8,
),
dataset_kwargs={"chat": True, "use_hf_tokenizer_chat_template": True},
)
builder.prepare_packed_data()
Expected behavior
Tokenization and packing complete without error when using chat datasets with pad_seq_to_mult > 1.
Additional context
GPTSFTDataset (non-chat) is unaffected — it returns plain lists and does not include loss_mask in its output dict.
pad_seq_to_mult=1 is unaffected — the padding block is skipped entirely.
Describe the bug
tokenize_dataset()inpacked_sequence.pyfails whenchat=Trueandpad_seq_to_mult > 1:Tensor/list mismatch:
_chat_preprocessreturnstorch.LongTensor/torch.BoolTensor, butpre_pad_datasetconcatenates with plain lists (val + [pad_id] * ...), raisingTypeError.Missing
loss_maskpadding:pre_pad_datasetpadsinput_idsandcontext_idsbut notloss_mask. Sequences with different original lengths can round to the same paddedinput_idslength, socreate_histgroups them together — but theirloss_maskarrays differ in length, causingnp.array()infill_packing_strategyto fail withValueError: inhomogeneous shape.Steps/Code to reproduce bug
Expected behavior
Tokenization and packing complete without error when using chat datasets with
pad_seq_to_mult > 1.Additional context
GPTSFTDataset(non-chat) is unaffected — it returns plain lists and does not includeloss_maskin its output dict.pad_seq_to_mult=1is unaffected — the padding block is skipped entirely.