-
Notifications
You must be signed in to change notification settings - Fork 2
fix(orchestrator): chat_with_tools lacks fallback chain, fails immediately on default provider error #1941
Description
Summary
ModelOrchestrator::chat_with_tools (LlmProvider impl, crates/zeph-llm/src/orchestrator/mod.rs ~L466) delegates directly to default_provider with no fallback loop. When the default provider fails (e.g. HTTP 400, quota exhausted), the error propagates immediately to the agent — unlike chat() and stream() which go through chat_with_fallback() / stream_with_fallback() and try the next provider in the route chain.
Reproduction
Config: ~/.config/zeph/cloud.toml with [llm.orchestrator] and providers claude (default) + ollama (fallback).
- Claude API returns 400 (credit balance too low).
chat()calls succeed via fallback — TUI shows provider switched to ollama. ✓- Agent enters native tool_use path → calls
chat_with_tools→ goes directly toclaudeagain → 400 →Response processing failed. ✗
Log evidence
WARN zeph_llm::orchestrator: provider claude failed: Claude API request failed (status 400), trying next
# ... fallback to ollama works for chat() ...
DEBUG zeph_core::agent::tool_execution::legacy: using native tool_use path provider="orchestrator"
DEBUG llm_call{model=claude-sonnet-4-6}: zeph_llm::orchestrator: orchestrator delegating chat_with_tools default_provider=claude
ERROR llm_call{model=claude-sonnet-4-6}: zeph_llm::claude: Claude API error 400 Bad Request: credit balance too low
ERROR zeph_core::agent: Response processing failed: Claude API request failed (status 400 Bad Request)
Root cause
crates/zeph-llm/src/orchestrator/mod.rs, LlmProvider::chat_with_tools impl:
// Current (no fallback):
async fn chat_with_tools(...) {
let provider = self.providers.get(&self.default_provider)...;
let response = provider.chat_with_tools(messages, tools).await?; // <-- no fallback
...
}There is a route_chat_with_tools() helper (L306-344) that falls back to default on named-provider failure, but the trait impl itself has no multi-provider fallback at all.
Expected behavior
chat_with_tools should iterate the same fallback chain as chat_with_fallback() — try each provider in the route, skip unavailable ones, and only fail if all are exhausted. The "last used provider" should be updated to reflect the actual provider that succeeded.
Notes
- Affects all configs where
default_providerbecomes unavailable (quota, network, model not loaded in Ollama) route_chat_with_tools()can serve as reference for the fix pattern- Need to handle the case where fallback providers don't support
tool_usenatively (delegate tochat()or return a capability error)