[None][fix] Fix disaggregated cached token usage accounting#13620
Conversation
Signed-off-by: Shobhit Verma <[email protected]>
ec06753 to
9fc353c
Compare
📝 WalkthroughWalkthroughThe disaggregation service now rewrites generation-phase token usage metrics in both streaming and non-streaming modes using context-phase response data, supporting both context-first and generation-first scheduling strategies with cached token details. Changes
Sequence Diagram(s)sequenceDiagram
participant ContextClient as Context Client
participant GenClient as Generation Client
participant Service as Disagg Service
participant Queue as Output Queue
participant Consumer as Response Consumer
rect rgba(100, 150, 200, 0.5)
Note over Service: Context-First Scheduling
ContextClient->>Service: generate (context phase)
Service->>Service: execute context
Service->>GenClient: forward context result
GenClient->>GenClient: generate (generation phase)
GenClient-->>Service: return gen response + usage
Service->>Service: rewrite usage from ctx response
Service-->>Consumer: wrapped response (ctx-based usage)
end
rect rgba(200, 150, 100, 0.5)
Note over Service: Generation-First Scheduling
GenClient->>Service: stream generation output
Service->>Queue: capture ctx HTTP response
Service->>Queue: queue streamed chunks
Service->>Service: parse SSE data events
Service->>Service: rewrite usage in chunks
Queue-->>Consumer: streaming output (rewritten usage)
GenClient->>Service: final response
Service->>Service: post-process final response
Service-->>Consumer: final response (rewritten usage)
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Review rate limit: 9/10 reviews remaining, refill in 6 minutes. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
tests/unittest/disaggregated/test_openai_disagg_service.py (1)
181-237: QA test-list update is unnecessary for this PR scope.Given this PR adds/updates unit tests only (no
tests/integration/defs/*changes), no immediate QA list file edit is required; a follow-up integration case can be optional if you want scheduled end-to-end guardrails for this exact usage contract.As per coding guidelines "If the PR only touches unittest/ or narrow unit scope, say explicitly whether QA list updates are unnecessary or optional."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unittest/disaggregated/test_openai_disagg_service.py` around lines 181 - 237, The new unit test test_send_disagg_request_rewrites_streaming_usage only touches unittest scope and does not change integration definitions, so update the PR description or commit message to explicitly state that QA list updates are unnecessary (optional integration cases may be added later). Mention the test name test_send_disagg_request_rewrites_streaming_usage and that no tests/integration/defs/* changes were made so reviewers and QA know this is unit-only; no code changes required.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@tests/unittest/disaggregated/test_openai_disagg_service.py`:
- Around line 181-237: The new unit test
test_send_disagg_request_rewrites_streaming_usage only touches unittest scope
and does not change integration definitions, so update the PR description or
commit message to explicitly state that QA list updates are unnecessary
(optional integration cases may be added later). Mention the test name
test_send_disagg_request_rewrites_streaming_usage and that no
tests/integration/defs/* changes were made so reviewers and QA know this is
unit-only; no code changes required.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 7ca05a60-ab81-47f4-a0ac-d68adc8925d7
📒 Files selected for processing (2)
tensorrt_llm/serve/openai_disagg_service.pytests/unittest/disaggregated/test_openai_disagg_service.py
|
/bloom run |
|
/bot run |
|
PR_Github #46223 [ run ] triggered by Bot. Commit: |
|
@reasonsolo @lishicheng1996-nv could you review? |
|
PR_Github #46223 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #46301 [ run ] triggered by Bot. Commit: |
|
PR_Github #46301 [ run ] completed with state |
…3620) Signed-off-by: Shobhit Verma <[email protected]>
Summary
Fix OpenAI-compatible
prompt_tokens_details.cached_tokensaccounting for native disaggregated serving.When a request goes through the context worker first, the generation worker sees prompt KV that was supplied by the context worker. That internal KV-transfer state was being exposed directly as OpenAI
usage.prompt_tokens_details.cached_tokens, so disaggregated responses could report the entire prompt as cached even for a fresh, cache-busted request.This PR rewrites the final client-facing usage at the disaggregated master:
completion_tokensfrom the generation responseprompt_tokensandprompt_tokens_details.cached_tokensfrom the context responseusageeventsReproduction
Start a native TRT-LLM disaggregated endpoint, then send a fresh request with a unique cache-busting system prompt:
Observed on a live disaggregated master:
{ "usage": { "prompt_tokens": 74, "completion_tokens": 16, "total_tokens": 90, "prompt_tokens_details": { "cached_tokens": 74 } } }The request is fresh and cache-busted, so reporting
cached_tokens == prompt_tokensis not a user-facing prefix-cache hit. It is the generation worker reporting prompt KV it received from the context worker.Fix Details
The context worker owns the external prefill/cache accounting. The generation worker owns completion accounting, but its cached-token count includes KV supplied by the context worker in the disaggregated protocol.
For requests that use a context phase, the disaggregated master now normalizes the final usage before returning it:
Streaming responses are handled by rewriting only SSE events that contain a JSON
usageobject; content events and[DONE]pass through unchanged.Tests
Added unit coverage for both
context_firstandgeneration_first:Local checks run:
I could not run the pytest target in this sparse local checkout because the local Python environment does not have
pytestinstalled and importing the sparse TRT-LLM package requires non-checked-out vendored runtime files.Summary by CodeRabbit
Release Notes