Skip to content

fix(orchestrator): add circuit breaker and chat_with_tools fallback chain#1942

Merged
bug-ops merged 1 commit intomainfrom
feat/m27/1941-orchestrator-provider-failover
Mar 17, 2026
Merged

fix(orchestrator): add circuit breaker and chat_with_tools fallback chain#1942
bug-ops merged 1 commit intomainfrom
feat/m27/1941-orchestrator-provider-failover

Conversation

@bug-ops
Copy link
Copy Markdown
Owner

@bug-ops bug-ops commented Mar 17, 2026

Fixes #1941.

Problem

chat_with_tools delegated directly to default_provider with no fallback chain — unlike chat() and stream() which iterate through route providers. When the default provider failed (e.g. HTTP 400, quota), the agent got an immediate error even if a working fallback existed. Additionally, after a successful fallback via chat(), the next chat_with_tools call retried the failed provider again from scratch.

Changes

Circuit breaker

  • provider_failures: Arc<Mutex<HashMap<String, Instant>>> tracks the last failure time per provider
  • failure_ttl: Duration (default 300 s, configurable via failure_ttl_secs in config) — how long a failed provider is bypassed
  • is_provider_healthy / record_provider_failure / record_provider_success helpers
  • chat_with_fallback and stream_with_fallback prefer healthy providers; if all are unhealthy, fall back to full chain (graceful degradation)
  • with_failure_ttl(secs) builder method

chat_with_tools fallback

  • New chat_with_tools_with_fallback — same chain-iteration logic as chat_with_fallback
  • LlmProvider::chat_with_tools now calls chat_with_tools_with_fallback instead of going directly to default_provider

Config

[llm.orchestrator]
failure_ttl_secs = 300  # optional, defaults to 300 s

Tests

  • chat_with_tools_falls_back_when_default_fails
  • circuit_breaker_skips_recently_failed_provider
  • with_failure_ttl_sets_duration

6120 tests pass.

…hain

Fixes #1941.

- Add provider_failures (HashMap<String, Instant>) and failure_ttl
  (default 300s) fields to ModelOrchestrator for circuit breaker tracking
- After a provider fails, it is skipped for failure_ttl seconds on
  subsequent requests; if all providers are unhealthy, the full chain
  is attempted anyway (graceful degradation)
- chat_with_tools now uses chat_with_tools_with_fallback (same chain
  iteration logic as chat_with_fallback/stream_with_fallback) instead
  of delegating directly to default_provider with no fallback
- Add with_failure_ttl(secs) builder method for config-driven TTL
- Add failure_ttl_secs field to OrchestratorConfig (default: None = 300s)
- Wire failure_ttl_secs through bootstrap/provider.rs
- Add 3 new tests: tools fallback, circuit breaker behavior, TTL setter
@github-actions github-actions bot added llm zeph-llm crate (Ollama, Claude) rust Rust code changes core zeph-core crate bug Something isn't working labels Mar 17, 2026
@bug-ops bug-ops enabled auto-merge (squash) March 17, 2026 13:10
@github-actions github-actions bot added the size/L Large PR (201-500 lines) label Mar 17, 2026
@bug-ops bug-ops merged commit f48b6f5 into main Mar 17, 2026
20 checks passed
@bug-ops bug-ops deleted the feat/m27/1941-orchestrator-provider-failover branch March 17, 2026 13:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working core zeph-core crate llm zeph-llm crate (Ollama, Claude) rust Rust code changes size/L Large PR (201-500 lines)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix(orchestrator): chat_with_tools lacks fallback chain, fails immediately on default provider error

1 participant