[None][fix] Fix disaggregated cached token usage accounting by v-shobhit · Pull Request #13620 · NVIDIA/TensorRT-LLM

v-shobhit · 2026-04-29T22:17:38Z

Summary

Fix OpenAI-compatible prompt_tokens_details.cached_tokens accounting for native disaggregated serving.

When a request goes through the context worker first, the generation worker sees prompt KV that was supplied by the context worker. That internal KV-transfer state was being exposed directly as OpenAI usage.prompt_tokens_details.cached_tokens, so disaggregated responses could report the entire prompt as cached even for a fresh, cache-busted request.

This PR rewrites the final client-facing usage at the disaggregated master:

keep completion_tokens from the generation response
take prompt_tokens and prompt_tokens_details.cached_tokens from the context response
apply the same rewrite to streaming SSE usage events
leave generation-only / no-context paths unchanged

Reproduction

Start a native TRT-LLM disaggregated endpoint, then send a fresh request with a unique cache-busting system prompt:

ENDPOINT=http://<host>:<port>

curl -sS --max-time 60 "$ENDPOINT/v1/chat/completions" \
  -H 'Content-Type: application/json' \
  -d "{\"model\":\"openai/gpt-oss-120b\",\"messages\":[{\"role\":\"system\",\"content\":\"You are concise. [rid:manual-$(date +%s)]\"},{\"role\":\"user\",\"content\":\"Say pong.\"}],\"temperature\":0,\"max_tokens\":32,\"stream\":false}"

Observed on a live disaggregated master:

{
  "usage": {
    "prompt_tokens": 74,
    "completion_tokens": 16,
    "total_tokens": 90,
    "prompt_tokens_details": {
      "cached_tokens": 74
    }
  }
}

The request is fresh and cache-busted, so reporting cached_tokens == prompt_tokens is not a user-facing prefix-cache hit. It is the generation worker reporting prompt KV it received from the context worker.

Fix Details

The context worker owns the external prefill/cache accounting. The generation worker owns completion accounting, but its cached-token count includes KV supplied by the context worker in the disaggregated protocol.

For requests that use a context phase, the disaggregated master now normalizes the final usage before returning it:

client prompt_tokens        <- context response usage.prompt_tokens
client cached_tokens        <- context response usage.prompt_tokens_details.cached_tokens
client completion_tokens    <- generation response usage.completion_tokens
client total_tokens         <- prompt_tokens + completion_tokens

Streaming responses are handled by rewriting only SSE events that contain a JSON usage object; content events and [DONE] pass through unchanged.

Tests

Added unit coverage for both context_first and generation_first:

non-streaming disagg responses use context-side cached-token accounting
streaming SSE final usage chunks are rewritten the same way

Local checks run:

python3 -m py_compile \
  tensorrt_llm/serve/openai_disagg_service.py \
  tests/unittest/disaggregated/test_openai_disagg_service.py

git diff --check

I could not run the pytest target in this sparse local checkout because the local Python environment does not have pytest installed and importing the sparse TRT-LLM package requires non-checked-out vendored runtime files.

Summary by CodeRabbit

Release Notes

Bug Fixes
- Fixed token usage accounting in disaggregated serving for both streaming and non-streaming modes
- Improved accuracy of cached token tracking in token usage metrics
- Enhanced context and generation phase token counting in disaggregated configurations

Signed-off-by: Shobhit Verma <[email protected]>

coderabbitai · 2026-04-29T22:33:21Z

📝 Walkthrough

Walkthrough

The disaggregation service now rewrites generation-phase token usage metrics in both streaming and non-streaming modes using context-phase response data, supporting both context-first and generation-first scheduling strategies with cached token details.

Changes

Cohort / File(s)	Summary
Service Token Usage Rewriting `tensorrt_llm/serve/openai_disagg_service.py`	Implements token usage rewriting for both scheduling modes: context-first wraps the generation client response to reflect context-phase metrics; generation-first captures and post-processes queued streaming output and final responses. Includes SSE streaming parser that buffers chunks until separators and rewrites valid single-JSON `usage` payloads while preserving `[DONE]` and malformed entries.
Test Utilities and Validation `tests/unittest/disaggregated/test_openai_disagg_service.py`	Parameterizes `_make_completion_response` helper to configure `UsageInfo` fields and `PromptTokensDetails` for flexible test scenarios. Existing scheduling tests updated to assert disaggregated usage values including cached token counts. New async parameterized test validates streaming mode `usage` object rewriting by extracting and asserting individual usage fields from streamed chunks.

Sequence Diagram(s)

sequenceDiagram
    participant ContextClient as Context Client
    participant GenClient as Generation Client
    participant Service as Disagg Service
    participant Queue as Output Queue
    participant Consumer as Response Consumer

    rect rgba(100, 150, 200, 0.5)
    Note over Service: Context-First Scheduling
    ContextClient->>Service: generate (context phase)
    Service->>Service: execute context
    Service->>GenClient: forward context result
    GenClient->>GenClient: generate (generation phase)
    GenClient-->>Service: return gen response + usage
    Service->>Service: rewrite usage from ctx response
    Service-->>Consumer: wrapped response (ctx-based usage)
    end

    rect rgba(200, 150, 100, 0.5)
    Note over Service: Generation-First Scheduling
    GenClient->>Service: stream generation output
    Service->>Queue: capture ctx HTTP response
    Service->>Queue: queue streamed chunks
    Service->>Service: parse SSE data events
    Service->>Service: rewrite usage in chunks
    Queue-->>Consumer: streaming output (rewritten usage)
    GenClient->>Service: final response
    Service->>Service: post-process final response
    Service-->>Consumer: final response (rewritten usage)
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main change: fixing disaggregated cached token usage accounting in OpenAI-compatible serving.
Description check	✅ Passed	The PR description includes a clear summary of the problem, fix details, reproduction steps, and test coverage—all required template sections are substantively addressed.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Review rate limit: 9/10 reviews remaining, refill in 6 minutes.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

tests/unittest/disaggregated/test_openai_disagg_service.py (1)
181-237: QA test-list update is unnecessary for this PR scope.

Given this PR adds/updates unit tests only (no tests/integration/defs/* changes), no immediate QA list file edit is required; a follow-up integration case can be optional if you want scheduled end-to-end guardrails for this exact usage contract.

As per coding guidelines "If the PR only touches unittest/ or narrow unit scope, say explicitly whether QA list updates are unnecessary or optional."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unittest/disaggregated/test_openai_disagg_service.py` around lines 181
- 237, The new unit test test_send_disagg_request_rewrites_streaming_usage only
touches unittest scope and does not change integration definitions, so update
the PR description or commit message to explicitly state that QA list updates
are unnecessary (optional integration cases may be added later). Mention the
test name test_send_disagg_request_rewrites_streaming_usage and that no
tests/integration/defs/* changes were made so reviewers and QA know this is
unit-only; no code changes required.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tests/unittest/disaggregated/test_openai_disagg_service.py`:
- Around line 181-237: The new unit test
test_send_disagg_request_rewrites_streaming_usage only touches unittest scope
and does not change integration definitions, so update the PR description or
commit message to explicitly state that QA list updates are unnecessary
(optional integration cases may be added later). Mention the test name
test_send_disagg_request_rewrites_streaming_usage and that no
tests/integration/defs/* changes were made so reviewers and QA know this is
unit-only; no code changes required.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 7ca05a60-ab81-47f4-a0ac-d68adc8925d7

📥 Commits

Reviewing files that changed from the base of the PR and between 3b7af1c and ec06753.

📒 Files selected for processing (2)

tensorrt_llm/serve/openai_disagg_service.py
tests/unittest/disaggregated/test_openai_disagg_service.py

dongfengy · 2026-04-29T23:12:46Z

/bloom run

dongfengy · 2026-04-29T23:14:27Z

/bot run

tensorrt-cicd · 2026-04-29T23:22:02Z

PR_Github #46223 [ run ] triggered by Bot. Commit: 9fc353c Link to invocation

dongfengy · 2026-04-30T00:07:06Z

@reasonsolo @lishicheng1996-nv could you review?

tensorrt-cicd · 2026-04-30T04:42:09Z

PR_Github #46223 [ run ] completed with state SUCCESS. Commit: 9fc353c
/LLM/main/L0_MergeRequest_PR pipeline #36335 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

reasonsolo · 2026-04-30T05:16:32Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-04-30T05:22:22Z

PR_Github #46301 [ run ] triggered by Bot. Commit: 9fc353c Link to invocation

tensorrt-cicd · 2026-04-30T14:46:59Z

PR_Github #46301 [ run ] completed with state SUCCESS. Commit: 9fc353c
/LLM/main/L0_MergeRequest_PR pipeline #36400 completed with status: 'SUCCESS'

CI Report

Link to invocation

…3620) Signed-off-by: Shobhit Verma <[email protected]>

github-actions Bot assigned v-shobhit Apr 29, 2026

v-shobhit marked this pull request as ready for review April 29, 2026 22:29

v-shobhit requested a review from a team as a code owner April 29, 2026 22:29

v-shobhit requested a review from zhenhuaw-me April 29, 2026 22:29

v-shobhit changed the title ~~Fix disaggregated cached token usage accounting~~ [None][fix] Fix disaggregated cached token usage accounting Apr 29, 2026

Fix disaggregated cached token usage

9fc353c

Signed-off-by: Shobhit Verma <[email protected]>

v-shobhit force-pushed the codex/fix-disagg-cached-token-usage branch from ec06753 to 9fc353c Compare April 29, 2026 22:32

coderabbitai Bot reviewed Apr 29, 2026

View reviewed changes

dongfengy requested a review from reasonsolo April 29, 2026 22:57

v-shobhit mentioned this pull request Apr 29, 2026

[None][fix] Fix disaggregated cached token usage (feat/bench_x) #13623

Closed

dongfengy requested a review from lishicheng1996-nv April 30, 2026 00:07

reasonsolo approved these changes Apr 30, 2026

View reviewed changes

reasonsolo self-assigned this Apr 30, 2026

reasonsolo enabled auto-merge (squash) April 30, 2026 08:23

pcastonguay approved these changes Apr 30, 2026

View reviewed changes

reasonsolo merged commit 6f8adb2 into NVIDIA:main Apr 30, 2026
7 checks passed

This was referenced May 7, 2026

[Bug]: KV cache block evicted from reuse tree before async CacheSender completes transfer, causing silent deadlock in disaggregated serving #12542

Closed

[Bug]: Disaggregated Serving: Sequence Lifecycle Race Condition in PyExecutor #12540

Open

yufeiwu-nv pushed a commit to yufeiwu-nv/TensorRT-LLM that referenced this pull request May 19, 2026

[None][fix] Fix disaggregated cached token usage accounting (NVIDIA#1…

8692f28

…3620) Signed-off-by: Shobhit Verma <[email protected]>

Conversation

v-shobhit commented Apr 29, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Reproduction

Fix Details

Tests

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented Apr 29, 2026

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

dongfengy commented Apr 29, 2026

Uh oh!

dongfengy commented Apr 29, 2026

Uh oh!

tensorrt-cicd commented Apr 29, 2026

Uh oh!

dongfengy commented Apr 30, 2026

Uh oh!

tensorrt-cicd commented Apr 30, 2026

Uh oh!

reasonsolo commented Apr 30, 2026

Uh oh!

tensorrt-cicd commented Apr 30, 2026

Uh oh!

tensorrt-cicd commented Apr 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

v-shobhit commented Apr 29, 2026 •

edited by coderabbitai Bot

Loading