Skip to content

[TRTLLM-11547][feat] Add Qwen3.5 MTP support.#12646

Merged
nv-guomingz merged 1 commit into
NVIDIA:mainfrom
nv-guomingz:user/guomingz/qwen3.5-mtp-rebase-izzy
May 21, 2026
Merged

[TRTLLM-11547][feat] Add Qwen3.5 MTP support.#12646
nv-guomingz merged 1 commit into
NVIDIA:mainfrom
nv-guomingz:user/guomingz/qwen3.5-mtp-rebase-izzy

Conversation

@nv-guomingz
Copy link
Copy Markdown
Collaborator

@nv-guomingz nv-guomingz commented Apr 1, 2026

Summary by CodeRabbit

Release Notes

  • New Features

    • Added Multi-Token Prediction (MTP) speculative execution support for Qwen3Next models, enabling faster inference through draft prediction layers.
    • Improved position embedding handling for different rope (Rotary Position Embedding) configurations in speculative generation.
  • Refactor

    • Restructured gated delta-rule computation into a dedicated module for better code organization.
    • Optimized key-value cache management for speculative layers.
  • Tests

    • Added test coverage for position ID handling in speculative execution scenarios.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 1, 2026

📝 Walkthrough

Walkthrough

This pull request introduces Multi-TP (MTP) support for the Qwen3Next model by implementing checkpoint remapping for MTP layers, refactoring the Gated Delta Net implementation into a dedicated module, and enhancing speculative execution handling with improved position ID management for multi-axis RoPE variants.

Changes

Cohort / File(s) Summary
MTP Architecture & Weight Mapping
tensorrt_llm/_torch/models/qwen3_next_weight_mapper.py, tensorrt_llm/_torch/models/modeling_qwen3_next.py, tensorrt_llm/_torch/models/modeling_speculative.py
Introduces Multi-TP layer remapping in checkpoints; adds Qwen3NextMTPHead and Qwen3NextMTP classes; replaces single aux_stream with aux_stream_dict for multi-stream support; extends Qwen3NextForCausalLM to handle MTP layers in speculative decoding mode; adds model_type matching for qwen3_next in speculative layer selection.
Gated Delta Net Module
tensorrt_llm/_torch/modules/mamba/gdn_mixer.py
New module implementing Qwen3NextGatedDeltaNet with Triton-based kernels for fused QKVZ/BA splitting and gated delta-rule gating; includes forward paths for decode, extend (prefix+decode), and speculative verification with configurable state management and optional speculative intermediate updates.
Speculative Execution & Position ID Handling
tensorrt_llm/_torch/speculative/mtp.py, tensorrt_llm/_torch/pyexecutor/_util.py, tensorrt_llm/_torch/pyexecutor/model_engine.py
Adds helper functions to normalize and select position IDs across standard RoPE and multi-axis MRoPE layouts; conditionally augments KV cache masks for MTP speculative layers; makes position ID offset updates conditional on presence of previous batch tokens.
Testing
tests/unittest/_torch/speculative/test_mtp.py
New TestMTPPositionIds test class validating _select_mtp_position_ids preservation of MRoPE axes and prepare_drafter_inputs correctness for 3D position ID tensors in generation steps.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 42.86% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ⚠️ Warning The PR description contains only the repository template with no specific implementation details, test coverage information, or checklist completion. Fill in the PR description with: a clear summary of what Qwen3.5 MTP support adds, why these changes were made, relevant test cases that cover the new functionality, and completion of the PR checklist items.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly identifies the main feature being added (Qwen3.5 MTP support) and is concise and specific to the changeset's primary objective.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tensorrt_llm/_torch/models/modeling_qwen3_next.py`:
- Around line 655-660: The code calls allgather(...) inside the
enable_lm_head_tp_in_adp branch (see enable_lm_head_tp_in_adp,
mapping_lm_head_tp, create_lm_head_tp_mapping) but allgather is not imported,
causing a NameError; fix by adding allgather to the import list from the
distributed module (the same place other distributed helpers are imported) so
that allgather is available when mapping_lm_head_tp and hidden_states are
all-gathered.

In `@tensorrt_llm/_torch/models/modeling_speculative.py`:
- Around line 798-800: The draft-mode code paths need the same Qwen3/Qwen3.5
handling as the single-engine path: update MTPDraftModel.__init__ to accept
"qwen3_5_moe_text" (in addition to "qwen3_next") and map it to the same MTP
class (Qwen3NextMTP imported from .modeling_qwen3_next), and extend
MTPDraftModelForCausalLM.load_weights to include the Qwen3/Qwen3.5 branch so the
matcher that handles qwen3_next also handles qwen3_5_moe_text; ensure any
spec_dec_mode.is_mtp_eagle() / two-engine checks treat both names symmetrically.

In `@tensorrt_llm/_torch/modules/mamba/gdn_mixer.py`:
- Around line 107-162: There is a duplicate Triton kernel definition named
fused_qkvzba_split_reshape_cat_kernel that overwrites the earlier one and causes
Ruff F811; remove the second definition (the entire function starting at the
later occurrence) so only the original fused_qkvzba_split_reshape_cat_kernel
remains; locate the duplicate by searching for the function name
fused_qkvzba_split_reshape_cat_kernel and delete the later block (including its
signature and body) to resolve the redefinition error and restore CI.

In `@tensorrt_llm/_torch/pyexecutor/_util.py`:
- Around line 1038-1049: The spec-layer extension currently appends spec entries
to hybrid_layer_mask/mamba_layer_mask without considering the caller-provided
layer_mask; update the logic so that if a layer_mask argument is provided you
first apply it to the existing hybrid_layer_mask and mamba_layer_mask (e.g.,
element-wise AND/zip) before computing num_layers and before extending with spec
layers, and ensure you align lengths (or pad/truncate) before extending with
get_num_spec_layers(spec_config) so the per-manager masks remain correct for
one-model separate draft KV mode.

In `@tensorrt_llm/_torch/pyexecutor/model_engine.py`:
- Around line 1525-1531: The slice assignment to inputs['position_ids'] using
previous_batch_tokens and previous_pos_id_offsets_cuda has closing bracket
indentation that triggers Flake8 E123; locate the block around
previous_batch_tokens > 0 in method/model where inputs['position_ids'] is
modified and reformat the expression so the bracketed index and the added value
align on the same indentation level (e.g., put the full slice [0,
num_ctx_tokens:num_ctx_tokens + previous_batch_tokens] on one line or keep the
opening bracket on the same column as the closing bracket) and ensure the
continuation of + (self.previous_pos_id_offsets_cuda[:previous_batch_tokens]) is
indented consistently; update the lines referencing previous_batch_tokens,
inputs['position_ids'], and previous_pos_id_offsets_cuda to satisfy E123.

In `@tensorrt_llm/_torch/speculative/mtp.py`:
- Around line 1038-1058: The reshape fails for padded 3D MRoPE batches because
the new branch in _select_mtp_position_ids/position_ids_gen assumes the last
axis is already truncated, but SpecDecOneEngineForCausalLM.forward still slices
position_ids as position_ids[:, :attn_metadata.num_tokens] (slicing the second
axis), leaving the time axis padded; fix by ensuring position_ids are truncated
on their last axis before this code runs — either update
SpecDecOneEngineForCausalLM.forward to slice position_ids on the final dimension
(e.g., position_ids[..., :attn_metadata.num_tokens]) or, inside the block that
computes position_ids_gen in _select_mtp_position_ids, explicitly slice/truncate
the last dimension to attn_metadata.num_tokens (or detect padded_num_tokens and
trim) before reshaping position_ids_gen and computing position_ids_delta so the
subsequent reshapes on position_ids_gen and position_ids_gen.flatten(...) cannot
encounter padded tokens.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 71fda09c-e57a-48bd-8259-effc37d874ac

📥 Commits

Reviewing files that changed from the base of the PR and between 3676be4 and 3e753b6.

📒 Files selected for processing (8)
  • tensorrt_llm/_torch/models/checkpoints/hf/qwen3_next_weight_mapper.py
  • tensorrt_llm/_torch/models/modeling_qwen3_next.py
  • tensorrt_llm/_torch/models/modeling_speculative.py
  • tensorrt_llm/_torch/modules/mamba/gdn_mixer.py
  • tensorrt_llm/_torch/pyexecutor/_util.py
  • tensorrt_llm/_torch/pyexecutor/model_engine.py
  • tensorrt_llm/_torch/speculative/mtp.py
  • tests/unittest/_torch/speculative/test_mtp.py

Comment thread tensorrt_llm/_torch/models/modeling_qwen3_next.py
Comment thread tensorrt_llm/_torch/models/modeling_speculative.py Outdated
Comment thread tensorrt_llm/_torch/modules/mamba/gdn_mixer.py Outdated
Comment thread tensorrt_llm/_torch/pyexecutor/_util.py Outdated
Comment thread tensorrt_llm/_torch/pyexecutor/model_engine.py Outdated
Comment thread tensorrt_llm/_torch/speculative/mtp.py
@nv-guomingz nv-guomingz force-pushed the user/guomingz/qwen3.5-mtp-rebase-izzy branch 3 times, most recently from 340868c to 5b02d2a Compare April 2, 2026 05:53
@nv-guomingz
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41353 [ run ] triggered by Bot. Commit: 5b02d2a Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41353 [ run ] completed with state SUCCESS. Commit: 5b02d2a
/LLM/main/L0_MergeRequest_PR pipeline #32298 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@nv-guomingz nv-guomingz force-pushed the user/guomingz/qwen3.5-mtp-rebase-izzy branch from 5b02d2a to 0d450fd Compare April 3, 2026 09:49
@nv-guomingz nv-guomingz requested a review from a team as a code owner April 3, 2026 09:49
@nv-guomingz nv-guomingz force-pushed the user/guomingz/qwen3.5-mtp-rebase-izzy branch from 0d450fd to 83415bd Compare April 3, 2026 11:42
@nv-guomingz nv-guomingz requested a review from a team as a code owner April 3, 2026 11:42
@nv-guomingz nv-guomingz force-pushed the user/guomingz/qwen3.5-mtp-rebase-izzy branch from 83415bd to e5ffbad Compare April 3, 2026 11:44
@nv-guomingz
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41659 [ run ] triggered by Bot. Commit: e5ffbad Link to invocation

@nv-guomingz nv-guomingz force-pushed the user/guomingz/qwen3.5-mtp-rebase-izzy branch from e5ffbad to 4b09e78 Compare April 3, 2026 15:37
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41659 [ run ] completed with state SUCCESS. Commit: e5ffbad
/LLM/main/L0_MergeRequest_PR pipeline #32564 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Copy link
Copy Markdown
Collaborator

@sunnyqgg sunnyqgg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment thread tensorrt_llm/_torch/models/checkpoints/hf/qwen3_next_weight_mapper.py Outdated
@nv-guomingz nv-guomingz changed the title [None][feat] Add Qwen3.5 MTP support. [TRTLLM-11547][feat] Add Qwen3.5 MTP support. Apr 7, 2026
Comment thread tensorrt_llm/_torch/models/modeling_speculative.py
Comment thread tests/integration/defs/accuracy/test_llm_api_pytorch.py Outdated
@nv-guomingz
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #48731 [ run ] triggered by Bot. Commit: d8b8a0d Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #48731 [ run ] completed with state SUCCESS. Commit: d8b8a0d
/LLM/main/L0_MergeRequest_PR pipeline #38498 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@nv-guomingz
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@nv-guomingz nv-guomingz force-pushed the user/guomingz/qwen3.5-mtp-rebase-izzy branch 2 times, most recently from a856878 to 7d1ce7f Compare May 18, 2026 03:32
@nv-guomingz
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #48827 [ run ] triggered by Bot. Commit: 7d1ce7f Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #48827 [ run ] completed with state ABORTED. Commit: 7d1ce7f

Link to invocation

@nv-guomingz nv-guomingz force-pushed the user/guomingz/qwen3.5-mtp-rebase-izzy branch from 7d1ce7f to 0ff516c Compare May 19, 2026 06:55
@nv-guomingz
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #49128 [ run ] triggered by Bot. Commit: 0ff516c Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #49128 [ run ] completed with state SUCCESS. Commit: 0ff516c
/LLM/main/L0_MergeRequest_PR pipeline #38826 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@nv-guomingz nv-guomingz force-pushed the user/guomingz/qwen3.5-mtp-rebase-izzy branch from 0ff516c to 1e3e3d3 Compare May 20, 2026 02:53
@nv-guomingz
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #49319 [ run ] triggered by Bot. Commit: 1e3e3d3 Link to invocation

@nv-guomingz nv-guomingz force-pushed the user/guomingz/qwen3.5-mtp-rebase-izzy branch from 1e3e3d3 to 1f83515 Compare May 20, 2026 06:34
@nv-guomingz
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #49374 [ run ] triggered by Bot. Commit: 1f83515 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #49319 [ run ] completed with state ABORTED. Commit: 1e3e3d3

Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #49374 [ run ] completed with state SUCCESS. Commit: 1f83515
/LLM/main/L0_MergeRequest_PR pipeline #39027 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@nv-guomingz
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #49453 [ run ] triggered by Bot. Commit: 1f83515 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #49453 [ run ] completed with state SUCCESS. Commit: 1f83515
/LLM/main/L0_MergeRequest_PR pipeline #39097 completed with status: 'SUCCESS'

CI Report

Link to invocation

@nv-guomingz nv-guomingz merged commit 5d19712 into NVIDIA:main May 21, 2026
7 checks passed
@nv-guomingz nv-guomingz deleted the user/guomingz/qwen3.5-mtp-rebase-izzy branch May 21, 2026 03:14
xxi-nv pushed a commit to xxi-nv/TensorRT-LLM that referenced this pull request May 22, 2026
bmarimuthu-nv pushed a commit to nv-auto-deploy/TensorRT-LLM that referenced this pull request May 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants