Skip to content

fix(orchestrator): chat_with_tools lacks fallback chain, fails immediately on default provider error #1941

@bug-ops

Description

@bug-ops

Summary

ModelOrchestrator::chat_with_tools (LlmProvider impl, crates/zeph-llm/src/orchestrator/mod.rs ~L466) delegates directly to default_provider with no fallback loop. When the default provider fails (e.g. HTTP 400, quota exhausted), the error propagates immediately to the agent — unlike chat() and stream() which go through chat_with_fallback() / stream_with_fallback() and try the next provider in the route chain.

Reproduction

Config: ~/.config/zeph/cloud.toml with [llm.orchestrator] and providers claude (default) + ollama (fallback).

  1. Claude API returns 400 (credit balance too low).
  2. chat() calls succeed via fallback — TUI shows provider switched to ollama. ✓
  3. Agent enters native tool_use path → calls chat_with_tools → goes directly to claude again → 400 → Response processing failed. ✗

Log evidence

WARN zeph_llm::orchestrator: provider claude failed: Claude API request failed (status 400), trying next
# ... fallback to ollama works for chat() ...
DEBUG zeph_core::agent::tool_execution::legacy: using native tool_use path provider="orchestrator"
DEBUG llm_call{model=claude-sonnet-4-6}: zeph_llm::orchestrator: orchestrator delegating chat_with_tools default_provider=claude
ERROR llm_call{model=claude-sonnet-4-6}: zeph_llm::claude: Claude API error 400 Bad Request: credit balance too low
ERROR zeph_core::agent: Response processing failed: Claude API request failed (status 400 Bad Request)

Root cause

crates/zeph-llm/src/orchestrator/mod.rs, LlmProvider::chat_with_tools impl:

// Current (no fallback):
async fn chat_with_tools(...) {
    let provider = self.providers.get(&self.default_provider)...;
    let response = provider.chat_with_tools(messages, tools).await?;  // <-- no fallback
    ...
}

There is a route_chat_with_tools() helper (L306-344) that falls back to default on named-provider failure, but the trait impl itself has no multi-provider fallback at all.

Expected behavior

chat_with_tools should iterate the same fallback chain as chat_with_fallback() — try each provider in the route, skip unavailable ones, and only fail if all are exhausted. The "last used provider" should be updated to reflect the actual provider that succeeded.

Notes

  • Affects all configs where default_provider becomes unavailable (quota, network, model not loaded in Ollama)
  • route_chat_with_tools() can serve as reference for the fix pattern
  • Need to handle the case where fallback providers don't support tool_use natively (delegate to chat() or return a capability error)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions