[None][perf] Integrate the flashinfer gdn prefill kernel for qwen3.5 by nv-guomingz · Pull Request #13644 · NVIDIA/TensorRT-LLM

nv-guomingz · 2026-04-30T05:13:12Z

Summary by CodeRabbit

New Features
- Added FlashInfer integration for optimized GDN (Gated Delta Network) prefill operations, with optional environment variable configuration to control which implementation is used.
- Configurable fallback to existing implementation for systems without FlashInfer support.
Tests
- Added operator-level parity tests validating FlashInfer GDN adapter against reference implementation across various sequence configurations.
- Added environment variable routing tests to verify correct implementation selection based on configuration.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

nv-guomingz · 2026-05-07T06:08:14Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-05-07T06:16:46Z

PR_Github #47115 [ run ] triggered by Bot. Commit: f73c47a Link to invocation

coderabbitai · 2026-05-07T06:17:16Z

📝 Walkthrough

Walkthrough

This PR introduces a FlashInfer-backed adapter for the Gated Delta Net (GDN) chunk attention operator. A new chunk_gated_delta_rule wrapper converts TRT-LLM's tensor layouts and conventions to FlashInfer's interface, with optional Triton fallback via environment flag. The integration into gdn_mixer is backward-compatible; comprehensive parity tests validate numerical equivalence across call shapes on supported GPUs.

Changes

FlashInfer GDN Adapter and Integration

Layer / File(s)	Summary
Wrapper Function Signature and Export `tensorrt_llm/_torch/modules/fla/flashinfer_chunk.py`	Exported function `chunk_gated_delta_rule(...)` decorated with `@torch.compiler.disable`, providing a unified call surface for Q/K/V, gates, initial state, varlen support, L2 normalization, and indexed in-place updates.
Argument Validation and Preconditions `tensorrt_llm/_torch/modules/fla/flashinfer_chunk.py`	Runtime assertions enforce tensor ranks, dtypes, fp32 requirement for g, unsupported head_first mode, required varlen cu_seqlens and initial_state, and conditional initial_state_indices when indexed updates are requested.
Layout and Value-Space Conversions `tensorrt_llm/_torch/modules/fla/flashinfer_chunk.py`	Squeeze Q/K/V from [1, T, H, D] to [T, H, D], convert g from log-space to linear-space via exp, optionally apply L2 normalization, and transpose initial SSM state from (K, V) to (V, K) layout with fp32 casting.
FlashInfer Integration and Buffer Management `tensorrt_llm/_torch/modules/fla/flashinfer_chunk.py`	Pre-allocate output and state buffers, compute head counts from packed tensors, invoke flashinfer.chunk_gated_delta_rule with converted inputs, and force use_qk_l2norm_in_kernel=False since normalization is pre-applied.
Output Layout Conversion and Return `tensorrt_llm/_torch/modules/fla/flashinfer_chunk.py`	Transpose returned state from (V, K) back to (K, V), optionally scatter into initial_state for indexed updates, conditionally return final state, and restore output activations to [1, T, H, D] layout.
Conditional Import and Mixer Integration `tensorrt_llm/_torch/modules/mamba/gdn_mixer.py`	gdn_mixer conditionally imports chunk_gated_delta_rule from flashinfer_chunk (default, enabled via TLLM_USE_FLASHINFER_GDN_PREFILL) or fla.chunk (fallback when disabled); both prefill and speculative-verify paths use the routed function.
Test Helpers and Architecture Gating `tests/unittest/_torch/modules/mamba/test_flashinfer_chunk_gdn.py`	Introduce _supported_arch() for SM90/SM100 capability check, skip_unsupported pytest marker, _make_inputs() for deterministic packed batch construction, and _zero_initial_state() for SSM tensor initialization.
Parity Tests and Routing Verification `tests/unittest/_torch/modules/mamba/test_flashinfer_chunk_gdn.py`	Six parity test cases (import smoke, single-sequence without/with L2 norm, variable-length batches, final-state output, indexed scatter) and one environment flag routing test; all compare FlashInfer results against Triton reference implementation.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The PR description is incomplete. Only the template structure is present with no actual content in the Description, Test Coverage, or relevant checklist items filled in.	Provide a clear description of what the PR does, why it's needed, what tests cover the changes, and ensure checklist items are properly reviewed.
Docstring Coverage	⚠️ Warning	Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly describes the main change: integrating the FlashInfer GDN prefill kernel, which is the primary objective evidenced by the three file changes.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (5)

tests/unittest/_torch/modules/mamba/test_flashinfer_chunk_gdn.py (3)
10-12: 💤 Low value

Drop from __future__ and use the Python 3.10+ list built-in.

from typing import List should be replaced by the built-in list type, and from __future__ import annotations is unnecessary.
♻️ Suggested change
-from __future__ import annotations
-
-from typing import List
-
 import pytest
 import torch
 `@torch.no_grad`()
 def _make_inputs(
-    seq_lens: List[int],
+    seq_lens: list[int],
Based on learnings: Python 3.10+ is required throughout the codebase and from __future__ import annotations is not needed. As per coding guidelines: "Prefer using built-in types list, dict, tuple instead of legacy typing.List."

Also applies to: 39-39
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unittest/_torch/modules/mamba/test_flashinfer_chunk_gdn.py` around
lines 10 - 12, Remove the unnecessary future import line "from __future__ import
annotations" and replace any use of the typing alias "List" with the built-in
"list" type; specifically delete the import "from typing import List" and update
all type annotations in this test (including the other occurrence around the
original line 39) from "List[...]" to "list[...]" so the file uses Python 3.10+
built-ins and no future import.
1-30: Missing perf test coverage for a [perf]-tagged kernel change.

This PR swaps the prefill attention kernel path (Triton → FlashInfer) for Qwen3.5 GDN, which is explicitly performance-sensitive. The added tests are all unit/parity tests and do not assert any throughput or latency improvement. Per QA guidelines for PRs touching attention kernels, please verify:

Is there an entry in tests/integration/test_lists/test-db/l0_perf.yml (or the appropriate per-GPU l0_*.yml) that will catch a FlashInfer GDN prefill regression pre-merge?

If no such entry exists, consider adding a perf test in tests/integration/defs/perf/test_perf_sanity.py to establish a latency baseline for Qwen3.5 prefill under TLLM_USE_FLASHINFER_GDN_PREFILL=1 vs =0.

QA list updates to llm_function_core.txt are not required for these unit tests alone, but the absence of any performance assertion means a future regression in the FlashInfer path would not be caught in CI.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unittest/_torch/modules/mamba/test_flashinfer_chunk_gdn.py` around
lines 1 - 30, Add perf coverage for the FlashInfer GDN prefill path: either add
an entry for the new FlashInfer-prefill case to the appropriate L0 perf test
list (tests/integration/test_lists/test-db/l0_perf.yml or per-GPU l0_*.yml) that
will run with TLLM_USE_FLASHINFER_GDN_PREFILL=1, or add a simple latency
baseline test in tests/integration/defs/perf/test_perf_sanity.py that measures
Qwen3.5 prefill latency for Qwen3NextGatedDeltaNet.forward_extend with
TLLM_USE_FLASHINFER_GDN_PREFILL toggled between 1 and 0; ensure the new perf
test targets the same input shapes exercised by the unit tests so regressions in
the FlashInfer prefill path are caught in CI.
60-65: 💤 Low value

Fix ruff RUF005: prefer iterable unpacking over list concatenation.
♻️ Suggested fix
     cu = torch.tensor(
-        [0] + list(torch.tensor(seq_lens).cumsum(0).tolist()),
+        [0, *torch.tensor(seq_lens).cumsum(0).tolist()],
         dtype=torch.int64,
         device=device,
     )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unittest/_torch/modules/mamba/test_flashinfer_chunk_gdn.py` around
lines 60 - 65, Replace the list concatenation used to build cu with iterable
unpacking: instead of torch.tensor([0] +
list(torch.tensor(seq_lens).cumsum(0).tolist()), ...), construct cu via
torch.tensor((0, *torch.tensor(seq_lens).cumsum(0).tolist()), dtype=torch.int64,
device=device). Update the expression that creates cu (the variable returned
alongside q, k, v, g, beta) to use the tuple unpacking form to satisfy RUF005.
tensorrt_llm/_torch/modules/fla/flashinfer_chunk.py (2)
35-35: 💤 Low value

Use Python 3.10+ built-in type syntax per coding guidelines.

from typing import Optional, Tuple and Tuple[torch.Tensor, Optional[torch.Tensor]] should use the modern style. The from __future__ import annotations import is also redundant for Python 3.10+.
♻️ Suggested change
-from __future__ import annotations
-
-from typing import Optional, Tuple
-
 import torch
 
 from tensorrt_llm._torch.modules.fla.l2norm import l2norm_fwd
 def chunk_gated_delta_rule(
     q: torch.Tensor,
     k: torch.Tensor,
     v: torch.Tensor,
     g: torch.Tensor,
     beta: torch.Tensor,
-    scale: Optional[float] = None,
-    initial_state: Optional[torch.Tensor] = None,
-    initial_state_indices: Optional[torch.Tensor] = None,
+    scale: float | None = None,
+    initial_state: torch.Tensor | None = None,
+    initial_state_indices: torch.Tensor | None = None,
     inplace_indexed_state_update: bool = False,
     output_final_state: bool = False,
-    cu_seqlens: Optional[torch.Tensor] = None,
+    cu_seqlens: torch.Tensor | None = None,
     head_first: bool = False,
     use_qk_l2norm_in_kernel: bool = False,
-) -> Tuple[torch.Tensor, Optional[torch.Tensor]]:
+) -> tuple[torch.Tensor, torch.Tensor | None]:
As per coding guidelines: "Prefer using built-in types list, dict, tuple instead of legacy typing.List, typing.Dict, typing.Tuple; use | syntax instead of typing.Union."

Also applies to: 61-61
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/modules/fla/flashinfer_chunk.py` at line 35, Replace
legacy typing usage and redundant future import: remove "from typing import
Optional, Tuple" (and any "from __future__ import annotations") and update type
annotations that use Tuple[...] and Optional[...] to modern Python 3.10+ syntax,
e.g., change "Tuple[torch.Tensor, Optional[torch.Tensor]]" to
"tuple[torch.Tensor, torch.Tensor | None]" (or use "torch.Tensor | None" for
optional parts). Update all occurrences (e.g., the annotation referenced around
the function/method using that tuple return type) to use built-in "tuple" and
the "|" union operator.
133-134: 💤 Low value

state_buf pre-allocation assumes D_k == D_v.

head_size = q3.shape[2] captures the key head dimension (D_k), but the last two dims of FlashInfer's state are (D_v, D_k). Using head_size for both silently produces a wrong buffer shape when D_k ≠ D_v.

For Qwen3.5 D_k == D_v == 128, so this is currently safe — but it's a silent fragility worth addressing:
♻️ Suggested fix
+v_head_size = v3.shape[2]
 head_size = q3.shape[2]
 num_seqs = cu_seqlens.shape[0] - 1
 output_buf = q3.new_empty(total_seq_len, num_o_heads, v_head_size)
-state_buf = q3.new_empty(num_seqs, num_o_heads, head_size, head_size, dtype=torch.float32)
+state_buf = q3.new_empty(num_seqs, num_o_heads, v_head_size, head_size, dtype=torch.float32)

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@tensorrt_llm/_torch/modules/fla/flashinfer_chunk.py`:
- Line 35: Replace legacy typing usage and redundant future import: remove "from
typing import Optional, Tuple" (and any "from __future__ import annotations")
and update type annotations that use Tuple[...] and Optional[...] to modern
Python 3.10+ syntax, e.g., change "Tuple[torch.Tensor, Optional[torch.Tensor]]"
to "tuple[torch.Tensor, torch.Tensor | None]" (or use "torch.Tensor | None" for
optional parts). Update all occurrences (e.g., the annotation referenced around
the function/method using that tuple return type) to use built-in "tuple" and
the "|" union operator.

In `@tests/unittest/_torch/modules/mamba/test_flashinfer_chunk_gdn.py`:
- Around line 10-12: Remove the unnecessary future import line "from __future__
import annotations" and replace any use of the typing alias "List" with the
built-in "list" type; specifically delete the import "from typing import List"
and update all type annotations in this test (including the other occurrence
around the original line 39) from "List[...]" to "list[...]" so the file uses
Python 3.10+ built-ins and no future import.
- Around line 1-30: Add perf coverage for the FlashInfer GDN prefill path:
either add an entry for the new FlashInfer-prefill case to the appropriate L0
perf test list (tests/integration/test_lists/test-db/l0_perf.yml or per-GPU
l0_*.yml) that will run with TLLM_USE_FLASHINFER_GDN_PREFILL=1, or add a simple
latency baseline test in tests/integration/defs/perf/test_perf_sanity.py that
measures Qwen3.5 prefill latency for Qwen3NextGatedDeltaNet.forward_extend with
TLLM_USE_FLASHINFER_GDN_PREFILL toggled between 1 and 0; ensure the new perf
test targets the same input shapes exercised by the unit tests so regressions in
the FlashInfer prefill path are caught in CI.
- Around line 60-65: Replace the list concatenation used to build cu with
iterable unpacking: instead of torch.tensor([0] +
list(torch.tensor(seq_lens).cumsum(0).tolist()), ...), construct cu via
torch.tensor((0, *torch.tensor(seq_lens).cumsum(0).tolist()), dtype=torch.int64,
device=device). Update the expression that creates cu (the variable returned
alongside q, k, v, g, beta) to use the tuple unpacking form to satisfy RUF005.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 5b4de1b8-1c30-4d25-8206-5bf3a3e7ba90

📥 Commits

Reviewing files that changed from the base of the PR and between c20b192 and e6f5624.

📒 Files selected for processing (3)

tensorrt_llm/_torch/modules/fla/flashinfer_chunk.py
tensorrt_llm/_torch/modules/mamba/gdn_mixer.py
tests/unittest/_torch/modules/mamba/test_flashinfer_chunk_gdn.py

tensorrt-cicd · 2026-05-07T11:31:26Z

PR_Github #47115 [ run ] completed with state SUCCESS. Commit: f73c47a
/LLM/main/L0_MergeRequest_PR pipeline #37082 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

nv-guomingz · 2026-05-07T11:50:22Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-05-07T11:57:07Z

PR_Github #47203 [ run ] triggered by Bot. Commit: f73c47a Link to invocation

tensorrt-cicd · 2026-05-07T15:08:00Z

PR_Github #47203 [ run ] completed with state SUCCESS. Commit: f73c47a
/LLM/main/L0_MergeRequest_PR pipeline #37159 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

nv-guomingz · 2026-05-08T02:26:44Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-05-08T02:32:50Z

PR_Github #47286 [ run ] triggered by Bot. Commit: f73c47a Link to invocation

tensorrt-cicd · 2026-05-08T04:03:31Z

PR_Github #47286 [ run ] completed with state SUCCESS. Commit: f73c47a
/LLM/main/L0_MergeRequest_PR pipeline #37228 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

nv-guomingz · 2026-05-08T11:37:10Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-05-08T11:43:13Z

PR_Github #47394 [ run ] triggered by Bot. Commit: f73c47a Link to invocation

tensorrt-cicd · 2026-05-08T12:49:46Z

PR_Github #47394 [ run ] completed with state SUCCESS. Commit: f73c47a
/LLM/main/L0_MergeRequest_PR pipeline #37324 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

nv-guomingz · 2026-05-09T01:27:36Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-05-09T01:33:01Z

PR_Github #47459 [ run ] triggered by Bot. Commit: 23bf4bf Link to invocation

nv-guomingz · 2026-05-25T04:21:08Z

/bot reuse-pipeline

tensorrt-cicd · 2026-05-25T04:27:35Z

PR_Github #50140 [ reuse-pipeline ] triggered by Bot. Commit: 4514934 Link to invocation

tensorrt-cicd · 2026-05-25T04:32:51Z

PR_Github #50140 [ reuse-pipeline ] completed with state SUCCESS. Commit: 4514934
Reusing PR_Github #50122 for commit 4514934

Link to invocation

HuiGao-NV

LGTM

nv-guomingz · 2026-05-25T08:16:04Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-05-25T08:21:41Z

PR_Github #50179 [ run ] triggered by Bot. Commit: 3f32586 Link to invocation

Signed-off-by: nv-guomingz <[email protected]>

tensorrt-cicd · 2026-05-25T13:29:46Z

PR_Github #50179 [ run ] completed with state SUCCESS. Commit: 3f32586
/LLM/main/L0_MergeRequest_PR pipeline #39722 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

nv-guomingz · 2026-05-25T13:45:54Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-05-25T13:51:23Z

PR_Github #50220 [ run ] triggered by Bot. Commit: e0df0be Link to invocation

tensorrt-cicd · 2026-05-25T19:38:02Z

PR_Github #50220 [ run ] completed with state FAILURE. Commit: e0df0be
/LLM/main/L0_MergeRequest_PR pipeline #39757 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

nv-guomingz · 2026-05-25T23:36:43Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-05-25T23:42:09Z

PR_Github #50239 [ run ] triggered by Bot. Commit: e0df0be Link to invocation

tensorrt-cicd · 2026-05-26T01:03:47Z

PR_Github #50239 [ run ] completed with state SUCCESS. Commit: e0df0be
/LLM/main/L0_MergeRequest_PR pipeline #39775 completed with status: 'SUCCESS'

CI Report

Link to invocation

…VIDIA#13644) Signed-off-by: nv-guomingz <[email protected]>

github-actions Bot assigned nv-guomingz Apr 30, 2026

nv-guomingz force-pushed the user/guomingz/gdn_prefill_fi branch 2 times, most recently from 7fc33eb to e6f5624 Compare May 7, 2026 06:06

nv-guomingz marked this pull request as ready for review May 7, 2026 06:07

nv-guomingz requested review from a team as code owners May 7, 2026 06:07

nv-guomingz requested review from QiJune, symphonylyh and tomeras91 May 7, 2026 06:07

nv-guomingz force-pushed the user/guomingz/gdn_prefill_fi branch from e6f5624 to f73c47a Compare May 7, 2026 06:10

coderabbitai Bot reviewed May 7, 2026

View reviewed changes

nv-guomingz force-pushed the user/guomingz/gdn_prefill_fi branch from f73c47a to 23bf4bf Compare May 8, 2026 16:27

nv-guomingz force-pushed the user/guomingz/gdn_prefill_fi branch from 23bf4bf to 626e893 Compare May 9, 2026 02:22

nv-guomingz requested a review from a team as a code owner May 9, 2026 02:22

nv-guomingz enabled auto-merge (squash) May 25, 2026 04:21