[TRTLLM-11547][feat] Add Qwen3.5 MTP support. by nv-guomingz · Pull Request #12646 · NVIDIA/TensorRT-LLM

nv-guomingz · 2026-04-01T03:01:40Z

Summary by CodeRabbit

Release Notes

New Features
- Added Multi-Token Prediction (MTP) speculative execution support for Qwen3Next models, enabling faster inference through draft prediction layers.
- Improved position embedding handling for different rope (Rotary Position Embedding) configurations in speculative generation.
Refactor
- Restructured gated delta-rule computation into a dedicated module for better code organization.
- Optimized key-value cache management for speculative layers.
Tests
- Added test coverage for position ID handling in speculative execution scenarios.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

coderabbitai · 2026-04-01T03:23:07Z

📝 Walkthrough

Walkthrough

This pull request introduces Multi-TP (MTP) support for the Qwen3Next model by implementing checkpoint remapping for MTP layers, refactoring the Gated Delta Net implementation into a dedicated module, and enhancing speculative execution handling with improved position ID management for multi-axis RoPE variants.

Changes

Cohort / File(s)	Summary
MTP Architecture & Weight Mapping `tensorrt_llm/_torch/models/qwen3_next_weight_mapper.py`, `tensorrt_llm/_torch/models/modeling_qwen3_next.py`, `tensorrt_llm/_torch/models/modeling_speculative.py`	Introduces Multi-TP layer remapping in checkpoints; adds `Qwen3NextMTPHead` and `Qwen3NextMTP` classes; replaces single `aux_stream` with `aux_stream_dict` for multi-stream support; extends `Qwen3NextForCausalLM` to handle MTP layers in speculative decoding mode; adds model_type matching for qwen3_next in speculative layer selection.
Gated Delta Net Module `tensorrt_llm/_torch/modules/mamba/gdn_mixer.py`	New module implementing `Qwen3NextGatedDeltaNet` with Triton-based kernels for fused QKVZ/BA splitting and gated delta-rule gating; includes forward paths for decode, extend (prefix+decode), and speculative verification with configurable state management and optional speculative intermediate updates.
Speculative Execution & Position ID Handling `tensorrt_llm/_torch/speculative/mtp.py`, `tensorrt_llm/_torch/pyexecutor/_util.py`, `tensorrt_llm/_torch/pyexecutor/model_engine.py`	Adds helper functions to normalize and select position IDs across standard RoPE and multi-axis MRoPE layouts; conditionally augments KV cache masks for MTP speculative layers; makes position ID offset updates conditional on presence of previous batch tokens.
Testing `tests/unittest/_torch/speculative/test_mtp.py`	New `TestMTPPositionIds` test class validating `_select_mtp_position_ids` preservation of MRoPE axes and `prepare_drafter_inputs` correctness for 3D position ID tensors in generation steps.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 42.86% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check	⚠️ Warning	The PR description contains only the repository template with no specific implementation details, test coverage information, or checklist completion.	Fill in the PR description with: a clear summary of what Qwen3.5 MTP support adds, why these changes were made, relevant test cases that cover the new functionality, and completion of the PR checklist items.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly identifies the main feature being added (Qwen3.5 MTP support) and is concise and specific to the changeset's primary objective.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 6

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tensorrt_llm/_torch/models/modeling_qwen3_next.py`:
- Around line 655-660: The code calls allgather(...) inside the
enable_lm_head_tp_in_adp branch (see enable_lm_head_tp_in_adp,
mapping_lm_head_tp, create_lm_head_tp_mapping) but allgather is not imported,
causing a NameError; fix by adding allgather to the import list from the
distributed module (the same place other distributed helpers are imported) so
that allgather is available when mapping_lm_head_tp and hidden_states are
all-gathered.

In `@tensorrt_llm/_torch/models/modeling_speculative.py`:
- Around line 798-800: The draft-mode code paths need the same Qwen3/Qwen3.5
handling as the single-engine path: update MTPDraftModel.__init__ to accept
"qwen3_5_moe_text" (in addition to "qwen3_next") and map it to the same MTP
class (Qwen3NextMTP imported from .modeling_qwen3_next), and extend
MTPDraftModelForCausalLM.load_weights to include the Qwen3/Qwen3.5 branch so the
matcher that handles qwen3_next also handles qwen3_5_moe_text; ensure any
spec_dec_mode.is_mtp_eagle() / two-engine checks treat both names symmetrically.

In `@tensorrt_llm/_torch/modules/mamba/gdn_mixer.py`:
- Around line 107-162: There is a duplicate Triton kernel definition named
fused_qkvzba_split_reshape_cat_kernel that overwrites the earlier one and causes
Ruff F811; remove the second definition (the entire function starting at the
later occurrence) so only the original fused_qkvzba_split_reshape_cat_kernel
remains; locate the duplicate by searching for the function name
fused_qkvzba_split_reshape_cat_kernel and delete the later block (including its
signature and body) to resolve the redefinition error and restore CI.

In `@tensorrt_llm/_torch/pyexecutor/_util.py`:
- Around line 1038-1049: The spec-layer extension currently appends spec entries
to hybrid_layer_mask/mamba_layer_mask without considering the caller-provided
layer_mask; update the logic so that if a layer_mask argument is provided you
first apply it to the existing hybrid_layer_mask and mamba_layer_mask (e.g.,
element-wise AND/zip) before computing num_layers and before extending with spec
layers, and ensure you align lengths (or pad/truncate) before extending with
get_num_spec_layers(spec_config) so the per-manager masks remain correct for
one-model separate draft KV mode.

In `@tensorrt_llm/_torch/pyexecutor/model_engine.py`:
- Around line 1525-1531: The slice assignment to inputs['position_ids'] using
previous_batch_tokens and previous_pos_id_offsets_cuda has closing bracket
indentation that triggers Flake8 E123; locate the block around
previous_batch_tokens > 0 in method/model where inputs['position_ids'] is
modified and reformat the expression so the bracketed index and the added value
align on the same indentation level (e.g., put the full slice [0,
num_ctx_tokens:num_ctx_tokens + previous_batch_tokens] on one line or keep the
opening bracket on the same column as the closing bracket) and ensure the
continuation of + (self.previous_pos_id_offsets_cuda[:previous_batch_tokens]) is
indented consistently; update the lines referencing previous_batch_tokens,
inputs['position_ids'], and previous_pos_id_offsets_cuda to satisfy E123.

In `@tensorrt_llm/_torch/speculative/mtp.py`:
- Around line 1038-1058: The reshape fails for padded 3D MRoPE batches because
the new branch in _select_mtp_position_ids/position_ids_gen assumes the last
axis is already truncated, but SpecDecOneEngineForCausalLM.forward still slices
position_ids as position_ids[:, :attn_metadata.num_tokens] (slicing the second
axis), leaving the time axis padded; fix by ensuring position_ids are truncated
on their last axis before this code runs — either update
SpecDecOneEngineForCausalLM.forward to slice position_ids on the final dimension
(e.g., position_ids[..., :attn_metadata.num_tokens]) or, inside the block that
computes position_ids_gen in _select_mtp_position_ids, explicitly slice/truncate
the last dimension to attn_metadata.num_tokens (or detect padded_num_tokens and
trim) before reshaping position_ids_gen and computing position_ids_delta so the
subsequent reshapes on position_ids_gen and position_ids_gen.flatten(...) cannot
encounter padded tokens.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 71fda09c-e57a-48bd-8259-effc37d874ac

📥 Commits

Reviewing files that changed from the base of the PR and between 3676be4 and 3e753b6.

📒 Files selected for processing (8)

tensorrt_llm/_torch/models/checkpoints/hf/qwen3_next_weight_mapper.py
tensorrt_llm/_torch/models/modeling_qwen3_next.py
tensorrt_llm/_torch/models/modeling_speculative.py
tensorrt_llm/_torch/modules/mamba/gdn_mixer.py
tensorrt_llm/_torch/pyexecutor/_util.py
tensorrt_llm/_torch/pyexecutor/model_engine.py
tensorrt_llm/_torch/speculative/mtp.py
tests/unittest/_torch/speculative/test_mtp.py

nv-guomingz · 2026-04-02T05:58:49Z

/bot run

tensorrt-cicd · 2026-04-02T06:05:05Z

PR_Github #41353 [ run ] triggered by Bot. Commit: 5b02d2a Link to invocation

tensorrt-cicd · 2026-04-02T08:22:49Z

PR_Github #41353 [ run ] completed with state SUCCESS. Commit: 5b02d2a
/LLM/main/L0_MergeRequest_PR pipeline #32298 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

nv-guomingz · 2026-04-03T11:46:01Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-04-03T11:52:19Z

PR_Github #41659 [ run ] triggered by Bot. Commit: e5ffbad Link to invocation

tensorrt-cicd · 2026-04-03T19:57:10Z

PR_Github #41659 [ run ] completed with state SUCCESS. Commit: e5ffbad
/LLM/main/L0_MergeRequest_PR pipeline #32564 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

sunnyqgg

LGTM

nv-guomingz · 2026-05-17T04:31:31Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-05-17T04:36:54Z

PR_Github #48731 [ run ] triggered by Bot. Commit: d8b8a0d Link to invocation

tensorrt-cicd · 2026-05-17T10:59:25Z

PR_Github #48731 [ run ] completed with state SUCCESS. Commit: d8b8a0d
/LLM/main/L0_MergeRequest_PR pipeline #38498 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

nv-guomingz · 2026-05-17T14:24:42Z

/bot run --disable-fail-fast

nv-guomingz · 2026-05-18T03:32:24Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-05-18T03:37:43Z

PR_Github #48827 [ run ] triggered by Bot. Commit: 7d1ce7f Link to invocation

tensorrt-cicd · 2026-05-18T12:25:07Z

PR_Github #48827 [ run ] completed with state ABORTED. Commit: 7d1ce7f

Link to invocation

nv-guomingz · 2026-05-19T06:55:14Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-05-19T07:02:46Z

PR_Github #49128 [ run ] triggered by Bot. Commit: 0ff516c Link to invocation

tensorrt-cicd · 2026-05-19T21:31:33Z

PR_Github #49128 [ run ] completed with state SUCCESS. Commit: 0ff516c
/LLM/main/L0_MergeRequest_PR pipeline #38826 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

nv-guomingz · 2026-05-20T02:53:33Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-05-20T02:59:51Z

PR_Github #49319 [ run ] triggered by Bot. Commit: 1e3e3d3 Link to invocation

Signed-off-by: nv-guomingz <[email protected]>

nv-guomingz · 2026-05-20T06:35:18Z

/bot run

tensorrt-cicd · 2026-05-20T06:41:30Z

PR_Github #49374 [ run ] triggered by Bot. Commit: 1f83515 Link to invocation

tensorrt-cicd · 2026-05-20T06:45:26Z

PR_Github #49319 [ run ] completed with state ABORTED. Commit: 1e3e3d3

Link to invocation

tensorrt-cicd · 2026-05-20T12:25:20Z

PR_Github #49374 [ run ] completed with state SUCCESS. Commit: 1f83515
/LLM/main/L0_MergeRequest_PR pipeline #39027 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

nv-guomingz · 2026-05-20T16:39:07Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-05-20T16:47:46Z

PR_Github #49453 [ run ] triggered by Bot. Commit: 1f83515 Link to invocation

tensorrt-cicd · 2026-05-20T23:11:11Z

PR_Github #49453 [ run ] completed with state SUCCESS. Commit: 1f83515
/LLM/main/L0_MergeRequest_PR pipeline #39097 completed with status: 'SUCCESS'

CI Report

Link to invocation

Signed-off-by: nv-guomingz <[email protected]>

nv-guomingz requested review from a team as code owners April 1, 2026 03:01

nv-guomingz requested review from lancelly, omera-nv, sunnyqgg and yechank-nvidia April 1, 2026 03:01

github-actions Bot assigned nv-guomingz Apr 1, 2026

coderabbitai Bot reviewed Apr 1, 2026

View reviewed changes

nv-guomingz force-pushed the user/guomingz/qwen3.5-mtp-rebase-izzy branch 3 times, most recently from 340868c to 5b02d2a Compare April 2, 2026 05:53

nv-guomingz force-pushed the user/guomingz/qwen3.5-mtp-rebase-izzy branch from 5b02d2a to 0d450fd Compare April 3, 2026 09:49

nv-guomingz requested a review from a team as a code owner April 3, 2026 09:49

nv-guomingz force-pushed the user/guomingz/qwen3.5-mtp-rebase-izzy branch from 0d450fd to 83415bd Compare April 3, 2026 11:42

nv-guomingz requested a review from a team as a code owner April 3, 2026 11:42

nv-guomingz force-pushed the user/guomingz/qwen3.5-mtp-rebase-izzy branch from 83415bd to e5ffbad Compare April 3, 2026 11:44

nv-guomingz force-pushed the user/guomingz/qwen3.5-mtp-rebase-izzy branch from e5ffbad to 4b09e78 Compare April 3, 2026 15:37

sunnyqgg approved these changes Apr 7, 2026

View reviewed changes

Comment thread tensorrt_llm/_torch/models/checkpoints/hf/qwen3_next_weight_mapper.py Outdated

nv-guomingz changed the title ~~[None][feat] Add Qwen3.5 MTP support.~~ [TRTLLM-11547][feat] Add Qwen3.5 MTP support. Apr 7, 2026

Wanli-Jiang reviewed Apr 8, 2026

View reviewed changes

Comment thread tensorrt_llm/_torch/models/modeling_speculative.py

Wanli-Jiang reviewed Apr 8, 2026

View reviewed changes

Comment thread tests/integration/defs/accuracy/test_llm_api_pytorch.py Outdated

nv-guomingz force-pushed the user/guomingz/qwen3.5-mtp-rebase-izzy branch 2 times, most recently from a856878 to 7d1ce7f Compare May 18, 2026 03:32

nv-guomingz force-pushed the user/guomingz/qwen3.5-mtp-rebase-izzy branch from 7d1ce7f to 0ff516c Compare May 19, 2026 06:55

nv-guomingz force-pushed the user/guomingz/qwen3.5-mtp-rebase-izzy branch from 0ff516c to 1e3e3d3 Compare May 20, 2026 02:53

[None][feat] Add the Qwen3.5 mtp support.

1f83515

Signed-off-by: nv-guomingz <[email protected]>

nv-guomingz force-pushed the user/guomingz/qwen3.5-mtp-rebase-izzy branch from 1e3e3d3 to 1f83515 Compare May 20, 2026 06:34

nv-guomingz merged commit 5d19712 into NVIDIA:main May 21, 2026
7 checks passed

nv-guomingz deleted the user/guomingz/qwen3.5-mtp-rebase-izzy branch May 21, 2026 03:14

xxi-nv pushed a commit to xxi-nv/TensorRT-LLM that referenced this pull request May 22, 2026

[TRTLLM-11547][feat] Add Qwen3.5 MTP support. (NVIDIA#12646)

4856f54

Signed-off-by: nv-guomingz <[email protected]>

moraxu mentioned this pull request May 26, 2026

[TRTLLM-12500][feat] Add support for Qwen3.5 VL MoE (with the MTP fixes) #14599

Open

1 task

bmarimuthu-nv pushed a commit to nv-auto-deploy/TensorRT-LLM that referenced this pull request May 28, 2026

[TRTLLM-11547][feat] Add Qwen3.5 MTP support. (NVIDIA#12646)

4bda677

Signed-off-by: nv-guomingz <[email protected]>

Conversation

nv-guomingz commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

coderabbitai Bot commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nv-guomingz commented Apr 2, 2026

Uh oh!

tensorrt-cicd commented Apr 2, 2026

Uh oh!

tensorrt-cicd commented Apr 2, 2026

Uh oh!

nv-guomingz commented Apr 3, 2026

Uh oh!

tensorrt-cicd commented Apr 3, 2026

Uh oh!

tensorrt-cicd commented Apr 3, 2026

Uh oh!

sunnyqgg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nv-guomingz commented May 17, 2026

Uh oh!

tensorrt-cicd commented May 17, 2026

Uh oh!

tensorrt-cicd commented May 17, 2026

Uh oh!

nv-guomingz commented May 17, 2026

Uh oh!

nv-guomingz commented May 18, 2026

Uh oh!

tensorrt-cicd commented May 18, 2026

Uh oh!

tensorrt-cicd commented May 18, 2026

Uh oh!

nv-guomingz commented May 19, 2026

Uh oh!

tensorrt-cicd commented May 19, 2026

Uh oh!

tensorrt-cicd commented May 19, 2026

Uh oh!

nv-guomingz commented May 20, 2026

Uh oh!

tensorrt-cicd commented May 20, 2026

Uh oh!

nv-guomingz commented May 20, 2026

Uh oh!

tensorrt-cicd commented May 20, 2026

Uh oh!

tensorrt-cicd commented May 20, 2026

Uh oh!

tensorrt-cicd commented May 20, 2026

Uh oh!

nv-guomingz commented May 20, 2026

Uh oh!

tensorrt-cicd commented May 20, 2026

Uh oh!

nv-guomingz commented Apr 1, 2026 •

edited

Loading

coderabbitai Bot commented Apr 1, 2026 •

edited

Loading