Conversation
…hain Fixes #1941. - Add provider_failures (HashMap<String, Instant>) and failure_ttl (default 300s) fields to ModelOrchestrator for circuit breaker tracking - After a provider fails, it is skipped for failure_ttl seconds on subsequent requests; if all providers are unhealthy, the full chain is attempted anyway (graceful degradation) - chat_with_tools now uses chat_with_tools_with_fallback (same chain iteration logic as chat_with_fallback/stream_with_fallback) instead of delegating directly to default_provider with no fallback - Add with_failure_ttl(secs) builder method for config-driven TTL - Add failure_ttl_secs field to OrchestratorConfig (default: None = 300s) - Wire failure_ttl_secs through bootstrap/provider.rs - Add 3 new tests: tools fallback, circuit breaker behavior, TTL setter
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #1941.
Problem
chat_with_toolsdelegated directly todefault_providerwith no fallback chain — unlikechat()andstream()which iterate through route providers. When the default provider failed (e.g. HTTP 400, quota), the agent got an immediate error even if a working fallback existed. Additionally, after a successful fallback viachat(), the nextchat_with_toolscall retried the failed provider again from scratch.Changes
Circuit breaker
provider_failures: Arc<Mutex<HashMap<String, Instant>>>tracks the last failure time per providerfailure_ttl: Duration(default 300 s, configurable viafailure_ttl_secsin config) — how long a failed provider is bypassedis_provider_healthy/record_provider_failure/record_provider_successhelperschat_with_fallbackandstream_with_fallbackprefer healthy providers; if all are unhealthy, fall back to full chain (graceful degradation)with_failure_ttl(secs)builder methodchat_with_tools fallback
chat_with_tools_with_fallback— same chain-iteration logic aschat_with_fallbackLlmProvider::chat_with_toolsnow callschat_with_tools_with_fallbackinstead of going directly todefault_providerConfig
Tests
chat_with_tools_falls_back_when_default_failscircuit_breaker_skips_recently_failed_providerwith_failure_ttl_sets_duration6120 tests pass.