[TRTLLM-10061][feat] Add FORCE_CHUNK context chunking policy by VALLIS-NERIA · Pull Request #12483 · NVIDIA/TensorRT-LLM

VALLIS-NERIA · 2026-03-24T05:54:06Z

Summary

Add a new FORCE_CHUNK context chunking policy that forces every context request to be chunked to a fixed unit_size, regardless of whether all requests fit in the batch
Needed for hybrid linear (Mamba) models with block reuse enabled, where consistent chunk boundaries are required for prefix cache correctness
C++: enum value, ostream support, MicroBatchScheduler template specialization, nanobind binding
Python: ChunkingPolicy enum, _chunk_forced scheduler method, ContextChunkingPolicy.FORCE_CHUNK in llm_args

Test plan

Existing chunking unit tests pass

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features
- Added a new FORCE_CHUNK context chunking policy option that enforces uniform chunk sizes across context requests, with each request assigned a chunk size equal to the configured unit size (or zero if at token capacity limit, except for the last chunk).

Add a new FORCE_CHUNK chunking policy that forces every context request to be chunked to a fixed unit_size. This is needed for hybrid linear (Mamba) models with block reuse enabled, where consistent chunk boundaries are required for prefix cache correctness. Changes span C++ core (enum, scheduler template specialization, nanobind binding) and Python (scheduler, llm_args config, py_executor_creator wiring). Signed-off-by: Xiwen Yu <[email protected]>

…licy Signed-off-by: Xiwen Yu <[email protected]>

Signed-off-by: Xiwen Yu <[email protected]>

coderabbitai · 2026-03-24T06:02:04Z

📝 Walkthrough

Walkthrough

A new kFORCE_CHUNK context chunking policy is introduced across the C++ and Python executor layers. This policy ensures every context request receives a chunk size capped by the unit size and remaining context length, with capacity-based request filtering. Implementation spans enum definitions, scheduler logic, Python bindings, and scheduler integration.

Changes

Cohort / File(s)	Summary
Enum Definition and C++ Core Support `cpp/include/tensorrt_llm/executor/types.h`, `cpp/tensorrt_llm/executor/types.cpp`	Added `kFORCE_CHUNK = 2` enum value to `ContextChunkingPolicy` and extended operator<< to emit `"FORCE_CHUNK"` for string representation.
C++ Scheduler Implementation `cpp/tensorrt_llm/batch_manager/microBatchScheduler.cpp`	Implemented `kFORCE_CHUNK` policy specialization in `setCtxRequestsChunkSize` that assigns chunk sizes capped by `chunkUnitSize` and remaining context length, applies capacity budgeting, validates `chunkUnitSize <= maxContextLength`, and forces re-chunking by setting `allContextRequestsFit = false` during request selection.
Python Bindings `cpp/tensorrt_llm/nanobind/executor/bindings.cpp`	Extended Python-exposed `ContextChunkingPolicy` enum bindings to include `FORCE_CHUNK` mapped to `tle::ContextChunkingPolicy::kFORCE_CHUNK`.
Python Enum and Scheduler Integration `tensorrt_llm/llmapi/llm_args.py`, `tensorrt_llm/_torch/pyexecutor/scheduler/scheduler.py`	Added `ChunkingPolicy.FORCE_CHUNK` enum value and wired dispatch logic; scheduler now forces chunking when policy is `FORCE_CHUNK` regardless of token budget fit; implemented `_chunk_forced` method that applies per-request chunk size capping with capacity-based zeroing and over-capacity warning emission.

Sequence Diagram

sequenceDiagram
    participant Req as Incoming Request
    participant Sch as Scheduler
    participant Pol as Chunking Policy
    participant ChunkLogic as Force Chunk Logic

    Req->>Sch: Select context requests
    Sch->>Pol: Check chunking policy
    Pol-->>Sch: kFORCE_CHUNK detected
    Sch->>Sch: Set allContextRequestsFit = false
    Sch->>ChunkLogic: Call _chunk_forced()
    ChunkLogic->>ChunkLogic: For each request: cap chunk by unit_size & remaining_length
    ChunkLogic->>ChunkLogic: Apply capacity budget: zero requests exceeding capacity
    ChunkLogic->>ChunkLogic: Emit warning if total assigned > capacity
    ChunkLogic-->>Sch: Return assigned chunk sizes
    Sch-->>Req: Return re-chunked requests

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 19.05% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically summarizes the main change: adding a FORCE_CHUNK context chunking policy.
Description check	✅ Passed	The PR description provides a clear summary of the FORCE_CHUNK policy addition, its motivation (hybrid linear models with block reuse), and covers all major implementation changes across C++ and Python layers.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Signed-off-by: Xiwen Yu <[email protected]>

VALLIS-NERIA · 2026-03-24T06:16:08Z

/bot run

tensorrt-cicd · 2026-03-24T06:21:40Z

PR_Github #40071 [ run ] triggered by Bot. Commit: 807e9d3 Link to invocation

nv-guomingz

LGTM

lancelly · 2026-03-24T06:36:43Z

@QiJune Here‘s a new chunking policy.

tensorrt-cicd · 2026-03-24T11:50:54Z

PR_Github #40071 [ run ] completed with state SUCCESS. Commit: 807e9d3
/LLM/main/L0_MergeRequest_PR pipeline #31225 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

VALLIS-NERIA · 2026-03-24T12:02:48Z

/bot run

tensorrt-cicd · 2026-03-24T12:08:37Z

PR_Github #40115 [ run ] triggered by Bot. Commit: 5f54054 Link to invocation

VALLIS-NERIA · 2026-03-24T15:05:05Z

/bot run

tensorrt-cicd · 2026-03-24T15:07:47Z

PR_Github #40115 [ run ] completed with state SUCCESS. Commit: 5f54054
/LLM/main/L0_MergeRequest_PR pipeline #31264 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

tensorrt-cicd · 2026-03-24T15:11:04Z

PR_Github #40141 [ run ] triggered by Bot. Commit: 43469ec Link to invocation

tensorrt-cicd · 2026-03-24T17:57:34Z

PR_Github #40141 [ run ] completed with state SUCCESS. Commit: 43469ec
/LLM/main/L0_MergeRequest_PR pipeline #31287 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

VALLIS-NERIA · 2026-03-25T01:12:05Z

/bot run

tensorrt-cicd · 2026-03-25T01:17:52Z

PR_Github #40200 [ run ] triggered by Bot. Commit: 43469ec Link to invocation

VALLIS-NERIA · 2026-03-25T03:00:25Z

/bot run

tensorrt-cicd · 2026-03-25T03:00:57Z

PR_Github #40200 [ run ] completed with state SUCCESS. Commit: 43469ec
/LLM/main/L0_MergeRequest_PR pipeline #31340 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

tensorrt-cicd · 2026-03-25T16:00:54Z

PR_Github #40299 [ run ] completed with state SUCCESS. Commit: bda1763
/LLM/main/L0_MergeRequest_PR pipeline #31412 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

- Set all_context_requests_fit=False instead of separate need_chunking variable, matching C++ behavior and ensuring correct sort ordering - Add max_context_length < unit_size validation matching C++ - Replace unreachable warning with assertion - Remove estimated_reusable_tokens=0 from shared _make_request helper (C++ default is already 0) - Fix docstring: "linear attention" -> "linear attention / Mamba2" Signed-off-by: Xiwen Yu <[email protected]>

VALLIS-NERIA · 2026-03-28T14:27:29Z

/bot run

tensorrt-cicd · 2026-03-28T14:33:29Z

PR_Github #40543 [ run ] triggered by Bot. Commit: 957f2bc Link to invocation

tensorrt-cicd · 2026-03-28T14:33:29Z

PR_Github #40543 [ run ] completed with state DISABLED
CI server is currently disabled for scheduled maintenance. Estimated completion time: 9 PM PST on 3/28.

Link to invocation

VALLIS-NERIA · 2026-03-30T02:14:07Z

/bot run

tensorrt-cicd · 2026-03-30T02:19:45Z

PR_Github #40626 [ run ] triggered by Bot. Commit: 957f2bc Link to invocation

tensorrt-cicd · 2026-03-30T07:28:04Z

PR_Github #40626 [ run ] completed with state SUCCESS. Commit: 957f2bc
/LLM/main/L0_MergeRequest_PR pipeline #31667 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

VALLIS-NERIA · 2026-03-30T08:09:48Z

/bot run

tensorrt-cicd · 2026-03-30T08:15:14Z

PR_Github #40696 [ run ] triggered by Bot. Commit: 957f2bc Link to invocation

tensorrt-cicd · 2026-03-30T14:02:33Z

PR_Github #40696 [ run ] completed with state SUCCESS. Commit: 957f2bc
/LLM/main/L0_MergeRequest_PR pipeline #31724 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

VALLIS-NERIA · 2026-03-31T01:27:25Z

/bot run

tensorrt-cicd · 2026-03-31T01:33:01Z

PR_Github #40813 [ run ] triggered by Bot. Commit: 957f2bc Link to invocation

QiJune

LGTM

tensorrt-cicd · 2026-03-31T05:54:34Z

PR_Github #40813 [ run ] completed with state SUCCESS. Commit: 957f2bc
/LLM/main/L0_MergeRequest_PR pipeline #31827 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

VALLIS-NERIA · 2026-03-31T06:41:01Z

/bot run

tensorrt-cicd · 2026-03-31T06:46:41Z

PR_Github #40885 [ run ] triggered by Bot. Commit: 957f2bc Link to invocation

tensorrt-cicd · 2026-03-31T11:29:24Z

PR_Github #40885 [ run ] completed with state SUCCESS. Commit: 957f2bc
/LLM/main/L0_MergeRequest_PR pipeline #31889 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

VALLIS-NERIA · 2026-04-01T01:42:20Z

/bot run

tensorrt-cicd · 2026-04-01T01:48:38Z

PR_Github #41063 [ run ] triggered by Bot. Commit: 957f2bc Link to invocation

tensorrt-cicd · 2026-04-01T03:39:41Z

PR_Github #41063 [ run ] completed with state FAILURE. Commit: 957f2bc
/LLM/main/L0_MergeRequest_PR pipeline #32040 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

VALLIS-NERIA · 2026-04-01T05:38:16Z

/bot run

tensorrt-cicd · 2026-04-01T05:45:22Z

PR_Github #41121 [ run ] triggered by Bot. Commit: 957f2bc Link to invocation

tensorrt-cicd · 2026-04-01T10:11:04Z

PR_Github #41121 [ run ] completed with state SUCCESS. Commit: 957f2bc
/LLM/main/L0_MergeRequest_PR pipeline #32094 completed with status: 'SUCCESS'

CI Report

Link to invocation

…12483) Signed-off-by: Xiwen Yu <[email protected]>

Funatiq · 2026-04-09T14:44:45Z

Is a single unit_size per request a hard requirement here? If multiples of unit_size per request are also fine, the kEQUAL_PROGRESS policy seem to achieve the same goal.

VALLIS-NERIA requested review from a team as code owners March 24, 2026 05:54

VALLIS-NERIA requested a review from schetlur-nv March 24, 2026 05:54

github-actions Bot assigned VALLIS-NERIA Mar 24, 2026

VALLIS-NERIA added 2 commits March 24, 2026 13:56

Merge remote-tracking branch 'origin' into user/xiweny/force_chunk_po…

730d6ea

…licy Signed-off-by: Xiwen Yu <[email protected]>

add tests for scheduler

6ed49f3

Signed-off-by: Xiwen Yu <[email protected]>

improve comments

807e9d3

Signed-off-by: Xiwen Yu <[email protected]>

nv-guomingz reviewed Mar 24, 2026

View reviewed changes

Comment thread tensorrt_llm/llmapi/llm_args.py

nv-guomingz reviewed Mar 24, 2026

View reviewed changes

Comment thread tensorrt_llm/_torch/pyexecutor/scheduler/scheduler.py Outdated

nv-guomingz approved these changes Mar 24, 2026

View reviewed changes

lancelly requested a review from QiJune March 24, 2026 06:36

Merge branch 'main' into user/xiweny/force_chunk_policy

5f54054

Merge branch 'main' into user/xiweny/force_chunk_policy

43469ec

Merge branch 'main' into user/xiweny/force_chunk_policy

bda1763

lancelly reviewed Mar 26, 2026

View reviewed changes

VALLIS-NERIA requested a review from a team March 30, 2026 03:44

QiJune approved these changes Mar 31, 2026

View reviewed changes

VALLIS-NERIA enabled auto-merge (squash) March 31, 2026 03:03

VALLIS-NERIA merged commit 7a2698b into NVIDIA:main Apr 1, 2026
5 checks passed

karen-sy pushed a commit to karen-sy/TensorRT-LLM that referenced this pull request Apr 7, 2026

[TRTLLM-10061][feat] Add FORCE_CHUNK context chunking policy (NVIDIA#…

51166fe

…12483) Signed-off-by: Xiwen Yu <[email protected]>

Conversation

VALLIS-NERIA commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

VALLIS-NERIA commented Mar 24, 2026

Uh oh!

tensorrt-cicd commented Mar 24, 2026

Uh oh!

Uh oh!

Uh oh!

nv-guomingz left a comment

Choose a reason for hiding this comment

Uh oh!

lancelly commented Mar 24, 2026

Uh oh!

tensorrt-cicd commented Mar 24, 2026

Uh oh!

VALLIS-NERIA commented Mar 24, 2026

Uh oh!

tensorrt-cicd commented Mar 24, 2026

Uh oh!

VALLIS-NERIA commented Mar 24, 2026

Uh oh!

tensorrt-cicd commented Mar 24, 2026

Uh oh!

tensorrt-cicd commented Mar 24, 2026

Uh oh!

tensorrt-cicd commented Mar 24, 2026

Uh oh!

VALLIS-NERIA commented Mar 25, 2026

Uh oh!

tensorrt-cicd commented Mar 25, 2026

Uh oh!

VALLIS-NERIA commented Mar 25, 2026

Uh oh!

tensorrt-cicd commented Mar 25, 2026

Uh oh!

tensorrt-cicd commented Mar 25, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

VALLIS-NERIA commented Mar 28, 2026

Uh oh!

tensorrt-cicd commented Mar 28, 2026

Uh oh!

tensorrt-cicd commented Mar 28, 2026

Uh oh!

VALLIS-NERIA commented Mar 30, 2026

Uh oh!

tensorrt-cicd commented Mar 30, 2026

Uh oh!

tensorrt-cicd commented Mar 30, 2026

Uh oh!

VALLIS-NERIA commented Mar 30, 2026

Uh oh!

tensorrt-cicd commented Mar 30, 2026

Uh oh!

tensorrt-cicd commented Mar 30, 2026

Uh oh!

VALLIS-NERIA commented Mar 31, 2026

Uh oh!

tensorrt-cicd commented Mar 31, 2026

Uh oh!

QiJune left a comment

Choose a reason for hiding this comment

Uh oh!

tensorrt-cicd commented Mar 31, 2026

VALLIS-NERIA commented Mar 24, 2026 •

edited

Loading

coderabbitai Bot commented Mar 24, 2026 •

edited

Loading