Skip to content

[None][fix] Fix disaggregated cached token usage accounting#13620

Merged
reasonsolo merged 1 commit into
NVIDIA:mainfrom
v-shobhit:codex/fix-disagg-cached-token-usage
Apr 30, 2026
Merged

[None][fix] Fix disaggregated cached token usage accounting#13620
reasonsolo merged 1 commit into
NVIDIA:mainfrom
v-shobhit:codex/fix-disagg-cached-token-usage

Conversation

@v-shobhit
Copy link
Copy Markdown
Collaborator

@v-shobhit v-shobhit commented Apr 29, 2026

Summary

Fix OpenAI-compatible prompt_tokens_details.cached_tokens accounting for native disaggregated serving.

When a request goes through the context worker first, the generation worker sees prompt KV that was supplied by the context worker. That internal KV-transfer state was being exposed directly as OpenAI usage.prompt_tokens_details.cached_tokens, so disaggregated responses could report the entire prompt as cached even for a fresh, cache-busted request.

This PR rewrites the final client-facing usage at the disaggregated master:

  • keep completion_tokens from the generation response
  • take prompt_tokens and prompt_tokens_details.cached_tokens from the context response
  • apply the same rewrite to streaming SSE usage events
  • leave generation-only / no-context paths unchanged

Reproduction

Start a native TRT-LLM disaggregated endpoint, then send a fresh request with a unique cache-busting system prompt:

ENDPOINT=http://<host>:<port>

curl -sS --max-time 60 "$ENDPOINT/v1/chat/completions" \
  -H 'Content-Type: application/json' \
  -d "{\"model\":\"openai/gpt-oss-120b\",\"messages\":[{\"role\":\"system\",\"content\":\"You are concise. [rid:manual-$(date +%s)]\"},{\"role\":\"user\",\"content\":\"Say pong.\"}],\"temperature\":0,\"max_tokens\":32,\"stream\":false}"

Observed on a live disaggregated master:

{
  "usage": {
    "prompt_tokens": 74,
    "completion_tokens": 16,
    "total_tokens": 90,
    "prompt_tokens_details": {
      "cached_tokens": 74
    }
  }
}

The request is fresh and cache-busted, so reporting cached_tokens == prompt_tokens is not a user-facing prefix-cache hit. It is the generation worker reporting prompt KV it received from the context worker.

Fix Details

The context worker owns the external prefill/cache accounting. The generation worker owns completion accounting, but its cached-token count includes KV supplied by the context worker in the disaggregated protocol.

For requests that use a context phase, the disaggregated master now normalizes the final usage before returning it:

client prompt_tokens        <- context response usage.prompt_tokens
client cached_tokens        <- context response usage.prompt_tokens_details.cached_tokens
client completion_tokens    <- generation response usage.completion_tokens
client total_tokens         <- prompt_tokens + completion_tokens

Streaming responses are handled by rewriting only SSE events that contain a JSON usage object; content events and [DONE] pass through unchanged.

Tests

Added unit coverage for both context_first and generation_first:

  • non-streaming disagg responses use context-side cached-token accounting
  • streaming SSE final usage chunks are rewritten the same way

Local checks run:

python3 -m py_compile \
  tensorrt_llm/serve/openai_disagg_service.py \
  tests/unittest/disaggregated/test_openai_disagg_service.py

git diff --check

I could not run the pytest target in this sparse local checkout because the local Python environment does not have pytest installed and importing the sparse TRT-LLM package requires non-checked-out vendored runtime files.

Summary by CodeRabbit

Release Notes

  • Bug Fixes
    • Fixed token usage accounting in disaggregated serving for both streaming and non-streaming modes
    • Improved accuracy of cached token tracking in token usage metrics
    • Enhanced context and generation phase token counting in disaggregated configurations

@v-shobhit v-shobhit marked this pull request as ready for review April 29, 2026 22:29
@v-shobhit v-shobhit requested a review from a team as a code owner April 29, 2026 22:29
@v-shobhit v-shobhit requested a review from zhenhuaw-me April 29, 2026 22:29
@v-shobhit v-shobhit changed the title Fix disaggregated cached token usage accounting [None][fix] Fix disaggregated cached token usage accounting Apr 29, 2026
@v-shobhit v-shobhit force-pushed the codex/fix-disagg-cached-token-usage branch from ec06753 to 9fc353c Compare April 29, 2026 22:32
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 29, 2026

📝 Walkthrough

Walkthrough

The disaggregation service now rewrites generation-phase token usage metrics in both streaming and non-streaming modes using context-phase response data, supporting both context-first and generation-first scheduling strategies with cached token details.

Changes

Cohort / File(s) Summary
Service Token Usage Rewriting
tensorrt_llm/serve/openai_disagg_service.py
Implements token usage rewriting for both scheduling modes: context-first wraps the generation client response to reflect context-phase metrics; generation-first captures and post-processes queued streaming output and final responses. Includes SSE streaming parser that buffers chunks until separators and rewrites valid single-JSON usage payloads while preserving [DONE] and malformed entries.
Test Utilities and Validation
tests/unittest/disaggregated/test_openai_disagg_service.py
Parameterizes _make_completion_response helper to configure UsageInfo fields and PromptTokensDetails for flexible test scenarios. Existing scheduling tests updated to assert disaggregated usage values including cached token counts. New async parameterized test validates streaming mode usage object rewriting by extracting and asserting individual usage fields from streamed chunks.

Sequence Diagram(s)

sequenceDiagram
    participant ContextClient as Context Client
    participant GenClient as Generation Client
    participant Service as Disagg Service
    participant Queue as Output Queue
    participant Consumer as Response Consumer

    rect rgba(100, 150, 200, 0.5)
    Note over Service: Context-First Scheduling
    ContextClient->>Service: generate (context phase)
    Service->>Service: execute context
    Service->>GenClient: forward context result
    GenClient->>GenClient: generate (generation phase)
    GenClient-->>Service: return gen response + usage
    Service->>Service: rewrite usage from ctx response
    Service-->>Consumer: wrapped response (ctx-based usage)
    end

    rect rgba(200, 150, 100, 0.5)
    Note over Service: Generation-First Scheduling
    GenClient->>Service: stream generation output
    Service->>Queue: capture ctx HTTP response
    Service->>Queue: queue streamed chunks
    Service->>Service: parse SSE data events
    Service->>Service: rewrite usage in chunks
    Queue-->>Consumer: streaming output (rewritten usage)
    GenClient->>Service: final response
    Service->>Service: post-process final response
    Service-->>Consumer: final response (rewritten usage)
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: fixing disaggregated cached token usage accounting in OpenAI-compatible serving.
Description check ✅ Passed The PR description includes a clear summary of the problem, fix details, reproduction steps, and test coverage—all required template sections are substantively addressed.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Review rate limit: 9/10 reviews remaining, refill in 6 minutes.

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
tests/unittest/disaggregated/test_openai_disagg_service.py (1)

181-237: QA test-list update is unnecessary for this PR scope.

Given this PR adds/updates unit tests only (no tests/integration/defs/* changes), no immediate QA list file edit is required; a follow-up integration case can be optional if you want scheduled end-to-end guardrails for this exact usage contract.

As per coding guidelines "If the PR only touches unittest/ or narrow unit scope, say explicitly whether QA list updates are unnecessary or optional."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unittest/disaggregated/test_openai_disagg_service.py` around lines 181
- 237, The new unit test test_send_disagg_request_rewrites_streaming_usage only
touches unittest scope and does not change integration definitions, so update
the PR description or commit message to explicitly state that QA list updates
are unnecessary (optional integration cases may be added later). Mention the
test name test_send_disagg_request_rewrites_streaming_usage and that no
tests/integration/defs/* changes were made so reviewers and QA know this is
unit-only; no code changes required.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tests/unittest/disaggregated/test_openai_disagg_service.py`:
- Around line 181-237: The new unit test
test_send_disagg_request_rewrites_streaming_usage only touches unittest scope
and does not change integration definitions, so update the PR description or
commit message to explicitly state that QA list updates are unnecessary
(optional integration cases may be added later). Mention the test name
test_send_disagg_request_rewrites_streaming_usage and that no
tests/integration/defs/* changes were made so reviewers and QA know this is
unit-only; no code changes required.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 7ca05a60-ab81-47f4-a0ac-d68adc8925d7

📥 Commits

Reviewing files that changed from the base of the PR and between 3b7af1c and ec06753.

📒 Files selected for processing (2)
  • tensorrt_llm/serve/openai_disagg_service.py
  • tests/unittest/disaggregated/test_openai_disagg_service.py

@dongfengy
Copy link
Copy Markdown
Collaborator

/bloom run

@dongfengy
Copy link
Copy Markdown
Collaborator

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46223 [ run ] triggered by Bot. Commit: 9fc353c Link to invocation

@dongfengy
Copy link
Copy Markdown
Collaborator

@reasonsolo @lishicheng1996-nv could you review?

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46223 [ run ] completed with state SUCCESS. Commit: 9fc353c
/LLM/main/L0_MergeRequest_PR pipeline #36335 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@reasonsolo
Copy link
Copy Markdown
Collaborator

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46301 [ run ] triggered by Bot. Commit: 9fc353c Link to invocation

@reasonsolo reasonsolo self-assigned this Apr 30, 2026
@reasonsolo reasonsolo enabled auto-merge (squash) April 30, 2026 08:23
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46301 [ run ] completed with state SUCCESS. Commit: 9fc353c
/LLM/main/L0_MergeRequest_PR pipeline #36400 completed with status: 'SUCCESS'

CI Report

Link to invocation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants