Skip to content

Fallback chain empty when session runs non-primary model (dead end after failover) #25912

@Taskle

Description

@Taskle

Bug

When a session is running a non-primary model (e.g. Codex after Claude was rate-limited), resolveFallbackCandidates() returns only the configured primary as a fallback — the configured fallback chain is skipped.

This creates a dead-end scenario:

  1. Claude (primary) hits rate limit → session fails over to Codex (configured fallback)
  2. Codex encounters an error (timeout, 5xx, etc.)
  3. resolveFallbackCandidates() sees Codex ≠ configured primary, so modelFallbacks = []
  4. Only the configured primary (Claude) is added as a fallback candidate
  5. Claude is still in cooldown and at candidate index >0, so shouldProbePrimaryDuringCooldown returns false (it only probes index 0)
  6. All candidates exhausted → hard failure with no recovery

Root cause

In src/agents/model-fallback.ts, resolveFallbackCandidates():

if (!sameModelCandidate(normalizedPrimary, configuredPrimary)) {
  return []; // Override model failed → go straight to configured default
}

This was intended to handle explicit --model overrides, but it also fires when the session is running a failover model. The configured fallback chain (which could include other working models) is discarded.

Impact

  • Post-failover sessions lose resilience — they can only fail back to the (possibly still-cooldown) primary
  • If the primary provider is in extended rate limiting (hours/days), sessions on the fallback model are fragile
  • Creates a vicious cycle: failover → no fallback chain → hard failure → manual intervention required

Proposed fix

Remove the early return and always include the configured fallback chain:

// When running a non-default model (e.g. after failover), still include
// the configured fallback chain so all models remain reachable.
return resolveAgentModelFallbackValues(params.cfg?.agents?.defaults?.model);

The createModelCandidateCollector already deduplicates by provider+model, so there's no risk of duplicate candidates. The existing fallbacksOverride path (for explicit overrides via spawn) is preserved and takes priority.

Test changes

Updated 5 tests in model-fallback.test.ts to reflect the new behavior:

  • Override models now fall back through the configured chain (not straight to primary)
  • All 30 tests pass

Environment

  • Discovered while diagnosing persistent flakiness after an Anthropic rate-limit event
  • Affects any setup with model.primary + model.fallbacks configured
  • Workaround: monkeypatch resolveFallbackCandidates in the dist file

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions