-
Notifications
You must be signed in to change notification settings - Fork 2
triage routing: context size metadata biases complexity classification for simple queries in long conversations #2228
Description
Summary
build_triage_prompt includes {msg_count} messages, ~{token_estimate} tokens as conversation context in the triage classification prompt. In long conversations (50+ messages from history), this causes simple queries to be misclassified as complex or expert because the LLM infers the ongoing conversation must be complex.
Root Cause
// crates/zeph-llm/src/router/triage.rs:350
format!(
r#"...Conversation context: {msg_count} messages, ~{token_estimate} tokens\n\nUser message:\n{truncated}..."#
)The triage model (gpt-4o-mini) sees "51 messages, ~15000 tokens" alongside a simple query like "Solve: 3+4" and may upgrade its tier estimate due to the implied ongoing complexity.
Reproduction
- Start a session with a long history (50+ messages — e.g., the persistent testing.toml conversation)
- Enable
routing = "triage"with two providers - Send:
Solve: 3+4 - Observe: initial classification is
tier="simple", but the same session's follow-up calls returntier="expert"ortier="medium"
Expected
Classification should be based primarily on the content of the current user message, not the accumulated conversation size. A simple arithmetic query should reliably return tier="simple" regardless of conversation length.
Actual
triage routing: chat_with_tools tier="simple" ← initial tool call
triage routing: chat tier="expert" ← follow-up with long history context
triage routing: chat tier="medium"
Impact
- MEDIUM: wrong tier → wrong provider selected → cost/quality mismatch
- Particularly affects long sessions where all queries eventually get escalated
- Makes cost optimization unreliable for sustained use
Fix Direction
Option A: Remove msg_count/token_estimate from the triage prompt — classify only the last user message content.
Option B: Replace absolute counts with bucketed labels: short/medium/long context to reduce noise.
Option C: Add a large_context_threshold where the classification prompt skips context size for small queries.
Discovered in CI-210 (2026-03-27) during live triage routing verification.