Skip to content

[TRTLLM-10061][feat] Add FORCE_CHUNK context chunking policy#12483

Merged
VALLIS-NERIA merged 8 commits into
NVIDIA:mainfrom
VALLIS-NERIA:user/xiweny/force_chunk_policy
Apr 1, 2026
Merged

[TRTLLM-10061][feat] Add FORCE_CHUNK context chunking policy#12483
VALLIS-NERIA merged 8 commits into
NVIDIA:mainfrom
VALLIS-NERIA:user/xiweny/force_chunk_policy

Conversation

@VALLIS-NERIA
Copy link
Copy Markdown
Collaborator

@VALLIS-NERIA VALLIS-NERIA commented Mar 24, 2026

Summary

  • Add a new FORCE_CHUNK context chunking policy that forces every context request to be chunked to a fixed unit_size, regardless of whether all requests fit in the batch
  • Needed for hybrid linear (Mamba) models with block reuse enabled, where consistent chunk boundaries are required for prefix cache correctness
  • C++: enum value, ostream support, MicroBatchScheduler template specialization, nanobind binding
  • Python: ChunkingPolicy enum, _chunk_forced scheduler method, ContextChunkingPolicy.FORCE_CHUNK in llm_args

Test plan

  • Existing chunking unit tests pass

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features
    • Added a new FORCE_CHUNK context chunking policy option that enforces uniform chunk sizes across context requests, with each request assigned a chunk size equal to the configured unit size (or zero if at token capacity limit, except for the last chunk).

Add a new FORCE_CHUNK chunking policy that forces every context request
to be chunked to a fixed unit_size. This is needed for hybrid linear
(Mamba) models with block reuse enabled, where consistent chunk
boundaries are required for prefix cache correctness.

Changes span C++ core (enum, scheduler template specialization, nanobind
binding) and Python (scheduler, llm_args config, py_executor_creator
wiring).

Signed-off-by: Xiwen Yu <[email protected]>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 24, 2026

📝 Walkthrough

Walkthrough

A new kFORCE_CHUNK context chunking policy is introduced across the C++ and Python executor layers. This policy ensures every context request receives a chunk size capped by the unit size and remaining context length, with capacity-based request filtering. Implementation spans enum definitions, scheduler logic, Python bindings, and scheduler integration.

Changes

Cohort / File(s) Summary
Enum Definition and C++ Core Support
cpp/include/tensorrt_llm/executor/types.h, cpp/tensorrt_llm/executor/types.cpp
Added kFORCE_CHUNK = 2 enum value to ContextChunkingPolicy and extended operator<< to emit "FORCE_CHUNK" for string representation.
C++ Scheduler Implementation
cpp/tensorrt_llm/batch_manager/microBatchScheduler.cpp
Implemented kFORCE_CHUNK policy specialization in setCtxRequestsChunkSize that assigns chunk sizes capped by chunkUnitSize and remaining context length, applies capacity budgeting, validates chunkUnitSize <= maxContextLength, and forces re-chunking by setting allContextRequestsFit = false during request selection.
Python Bindings
cpp/tensorrt_llm/nanobind/executor/bindings.cpp
Extended Python-exposed ContextChunkingPolicy enum bindings to include FORCE_CHUNK mapped to tle::ContextChunkingPolicy::kFORCE_CHUNK.
Python Enum and Scheduler Integration
tensorrt_llm/llmapi/llm_args.py, tensorrt_llm/_torch/pyexecutor/scheduler/scheduler.py
Added ChunkingPolicy.FORCE_CHUNK enum value and wired dispatch logic; scheduler now forces chunking when policy is FORCE_CHUNK regardless of token budget fit; implemented _chunk_forced method that applies per-request chunk size capping with capacity-based zeroing and over-capacity warning emission.

Sequence Diagram

sequenceDiagram
    participant Req as Incoming Request
    participant Sch as Scheduler
    participant Pol as Chunking Policy
    participant ChunkLogic as Force Chunk Logic

    Req->>Sch: Select context requests
    Sch->>Pol: Check chunking policy
    Pol-->>Sch: kFORCE_CHUNK detected
    Sch->>Sch: Set allContextRequestsFit = false
    Sch->>ChunkLogic: Call _chunk_forced()
    ChunkLogic->>ChunkLogic: For each request: cap chunk by unit_size & remaining_length
    ChunkLogic->>ChunkLogic: Apply capacity budget: zero requests exceeding capacity
    ChunkLogic->>ChunkLogic: Emit warning if total assigned > capacity
    ChunkLogic-->>Sch: Return assigned chunk sizes
    Sch-->>Req: Return re-chunked requests
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 19.05% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically summarizes the main change: adding a FORCE_CHUNK context chunking policy.
Description check ✅ Passed The PR description provides a clear summary of the FORCE_CHUNK policy addition, its motivation (hybrid linear models with block reuse), and covers all major implementation changes across C++ and Python layers.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Signed-off-by: Xiwen Yu <[email protected]>
@VALLIS-NERIA
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40071 [ run ] triggered by Bot. Commit: 807e9d3 Link to invocation

Comment thread tensorrt_llm/llmapi/llm_args.py
Comment thread tensorrt_llm/_torch/pyexecutor/scheduler/scheduler.py Outdated
Copy link
Copy Markdown
Collaborator

@nv-guomingz nv-guomingz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@lancelly
Copy link
Copy Markdown
Collaborator

@QiJune Here‘s a new chunking policy.

@lancelly lancelly requested a review from QiJune March 24, 2026 06:36
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40071 [ run ] completed with state SUCCESS. Commit: 807e9d3
/LLM/main/L0_MergeRequest_PR pipeline #31225 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@VALLIS-NERIA
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40115 [ run ] triggered by Bot. Commit: 5f54054 Link to invocation

@VALLIS-NERIA
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40115 [ run ] completed with state SUCCESS. Commit: 5f54054
/LLM/main/L0_MergeRequest_PR pipeline #31264 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40141 [ run ] triggered by Bot. Commit: 43469ec Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40141 [ run ] completed with state SUCCESS. Commit: 43469ec
/LLM/main/L0_MergeRequest_PR pipeline #31287 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@VALLIS-NERIA
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40200 [ run ] triggered by Bot. Commit: 43469ec Link to invocation

@VALLIS-NERIA
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40200 [ run ] completed with state SUCCESS. Commit: 43469ec
/LLM/main/L0_MergeRequest_PR pipeline #31340 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40299 [ run ] completed with state SUCCESS. Commit: bda1763
/LLM/main/L0_MergeRequest_PR pipeline #31412 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Comment thread tensorrt_llm/_torch/pyexecutor/scheduler/scheduler.py Outdated
Comment thread tensorrt_llm/_torch/pyexecutor/scheduler/scheduler.py Outdated
Comment thread tensorrt_llm/_torch/pyexecutor/scheduler/scheduler.py Outdated
Comment thread tests/unittest/_torch/executor/test_py_scheduler.py Outdated
- Set all_context_requests_fit=False instead of separate need_chunking
  variable, matching C++ behavior and ensuring correct sort ordering
- Add max_context_length < unit_size validation matching C++
- Replace unreachable warning with assertion
- Remove estimated_reusable_tokens=0 from shared _make_request helper
  (C++ default is already 0)
- Fix docstring: "linear attention" -> "linear attention / Mamba2"

Signed-off-by: Xiwen Yu <[email protected]>
@VALLIS-NERIA
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40543 [ run ] triggered by Bot. Commit: 957f2bc Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40543 [ run ] completed with state DISABLED
CI server is currently disabled for scheduled maintenance. Estimated completion time: 9 PM PST on 3/28.

Link to invocation

@VALLIS-NERIA
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40626 [ run ] triggered by Bot. Commit: 957f2bc Link to invocation

@VALLIS-NERIA VALLIS-NERIA requested a review from a team March 30, 2026 03:44
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40626 [ run ] completed with state SUCCESS. Commit: 957f2bc
/LLM/main/L0_MergeRequest_PR pipeline #31667 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@VALLIS-NERIA
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40696 [ run ] triggered by Bot. Commit: 957f2bc Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40696 [ run ] completed with state SUCCESS. Commit: 957f2bc
/LLM/main/L0_MergeRequest_PR pipeline #31724 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@VALLIS-NERIA
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40813 [ run ] triggered by Bot. Commit: 957f2bc Link to invocation

Copy link
Copy Markdown
Collaborator

@QiJune QiJune left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@VALLIS-NERIA VALLIS-NERIA enabled auto-merge (squash) March 31, 2026 03:03
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40813 [ run ] completed with state SUCCESS. Commit: 957f2bc
/LLM/main/L0_MergeRequest_PR pipeline #31827 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@VALLIS-NERIA
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40885 [ run ] triggered by Bot. Commit: 957f2bc Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40885 [ run ] completed with state SUCCESS. Commit: 957f2bc
/LLM/main/L0_MergeRequest_PR pipeline #31889 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@VALLIS-NERIA
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41063 [ run ] triggered by Bot. Commit: 957f2bc Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41063 [ run ] completed with state FAILURE. Commit: 957f2bc
/LLM/main/L0_MergeRequest_PR pipeline #32040 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@VALLIS-NERIA
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41121 [ run ] triggered by Bot. Commit: 957f2bc Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41121 [ run ] completed with state SUCCESS. Commit: 957f2bc
/LLM/main/L0_MergeRequest_PR pipeline #32094 completed with status: 'SUCCESS'

CI Report

Link to invocation

@VALLIS-NERIA VALLIS-NERIA merged commit 7a2698b into NVIDIA:main Apr 1, 2026
5 checks passed
karen-sy pushed a commit to karen-sy/TensorRT-LLM that referenced this pull request Apr 7, 2026
@Funatiq
Copy link
Copy Markdown
Collaborator

Funatiq commented Apr 9, 2026

Is a single unit_size per request a hard requirement here? If multiples of unit_size per request are also fine, the kEQUAL_PROGRESS policy seem to achieve the same goal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants