[TRTLLM-10061][feat] Add FORCE_CHUNK context chunking policy#12483
Conversation
Add a new FORCE_CHUNK chunking policy that forces every context request to be chunked to a fixed unit_size. This is needed for hybrid linear (Mamba) models with block reuse enabled, where consistent chunk boundaries are required for prefix cache correctness. Changes span C++ core (enum, scheduler template specialization, nanobind binding) and Python (scheduler, llm_args config, py_executor_creator wiring). Signed-off-by: Xiwen Yu <[email protected]>
…licy Signed-off-by: Xiwen Yu <[email protected]>
Signed-off-by: Xiwen Yu <[email protected]>
📝 WalkthroughWalkthroughA new Changes
Sequence DiagramsequenceDiagram
participant Req as Incoming Request
participant Sch as Scheduler
participant Pol as Chunking Policy
participant ChunkLogic as Force Chunk Logic
Req->>Sch: Select context requests
Sch->>Pol: Check chunking policy
Pol-->>Sch: kFORCE_CHUNK detected
Sch->>Sch: Set allContextRequestsFit = false
Sch->>ChunkLogic: Call _chunk_forced()
ChunkLogic->>ChunkLogic: For each request: cap chunk by unit_size & remaining_length
ChunkLogic->>ChunkLogic: Apply capacity budget: zero requests exceeding capacity
ChunkLogic->>ChunkLogic: Emit warning if total assigned > capacity
ChunkLogic-->>Sch: Return assigned chunk sizes
Sch-->>Req: Return re-chunked requests
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
Signed-off-by: Xiwen Yu <[email protected]>
|
/bot run |
|
PR_Github #40071 [ run ] triggered by Bot. Commit: |
|
@QiJune Here‘s a new chunking policy. |
|
PR_Github #40071 [ run ] completed with state
|
|
/bot run |
|
PR_Github #40115 [ run ] triggered by Bot. Commit: |
|
/bot run |
|
PR_Github #40115 [ run ] completed with state
|
|
PR_Github #40141 [ run ] triggered by Bot. Commit: |
|
PR_Github #40141 [ run ] completed with state
|
|
/bot run |
|
PR_Github #40200 [ run ] triggered by Bot. Commit: |
|
/bot run |
|
PR_Github #40200 [ run ] completed with state
|
|
PR_Github #40299 [ run ] completed with state
|
- Set all_context_requests_fit=False instead of separate need_chunking variable, matching C++ behavior and ensuring correct sort ordering - Add max_context_length < unit_size validation matching C++ - Replace unreachable warning with assertion - Remove estimated_reusable_tokens=0 from shared _make_request helper (C++ default is already 0) - Fix docstring: "linear attention" -> "linear attention / Mamba2" Signed-off-by: Xiwen Yu <[email protected]>
|
/bot run |
|
PR_Github #40543 [ run ] triggered by Bot. Commit: |
|
PR_Github #40543 [ run ] completed with state |
|
/bot run |
|
PR_Github #40626 [ run ] triggered by Bot. Commit: |
|
PR_Github #40626 [ run ] completed with state
|
|
/bot run |
|
PR_Github #40696 [ run ] triggered by Bot. Commit: |
|
PR_Github #40696 [ run ] completed with state
|
|
/bot run |
|
PR_Github #40813 [ run ] triggered by Bot. Commit: |
|
PR_Github #40813 [ run ] completed with state
|
|
/bot run |
|
PR_Github #40885 [ run ] triggered by Bot. Commit: |
|
PR_Github #40885 [ run ] completed with state
|
|
/bot run |
|
PR_Github #41063 [ run ] triggered by Bot. Commit: |
|
PR_Github #41063 [ run ] completed with state
|
|
/bot run |
|
PR_Github #41121 [ run ] triggered by Bot. Commit: |
|
PR_Github #41121 [ run ] completed with state |
…12483) Signed-off-by: Xiwen Yu <[email protected]>
|
Is a single |
Summary
FORCE_CHUNKcontext chunking policy that forces every context request to be chunked to a fixedunit_size, regardless of whether all requests fit in the batchostreamsupport,MicroBatchSchedulertemplate specialization, nanobind bindingChunkingPolicyenum,_chunk_forcedscheduler method,ContextChunkingPolicy.FORCE_CHUNKinllm_argsTest plan
🤖 Generated with Claude Code
Summary by CodeRabbit
FORCE_CHUNKcontext chunking policy option that enforces uniform chunk sizes across context requests, with each request assigned a chunk size equal to the configured unit size (or zero if at token capacity limit, except for the last chunk).