Skip to content

fix(agents): avoid duplicate same-provider cooldown probes in fallback runs#41711

Merged
altaywtf merged 5 commits intoopenclaw:mainfrom
cgdusek:codex/limit-cooldown-probe-per-provider
Mar 10, 2026
Merged

fix(agents): avoid duplicate same-provider cooldown probes in fallback runs#41711
altaywtf merged 5 commits intoopenclaw:mainfrom
cgdusek:codex/limit-cooldown-probe-per-provider

Conversation

@cgdusek
Copy link
Copy Markdown
Contributor

@cgdusek cgdusek commented Mar 10, 2026

Summary

  • limit cooldown-bypass probing to one attempt per provider per fallback run
  • preserve existing transient cooldown probe behavior (rate_limit / overloaded / billing) for the first candidate of that provider
  • add regression coverage for multi-model same-provider cooldown chains to ensure we continue to cross-provider fallbacks promptly

Why

In multi-model fallback chains, when all auth profiles for a provider are cooldowned, we currently allow cooldown-bypass probing on every same-provider candidate in one run. If that provider path has long internal retries/timeouts, fallback can spend most of the run repeatedly probing the same provider instead of reaching configured cross-provider candidates.

This patch keeps the first transient cooldown probe (so we still allow model-scoped recovery), but skips duplicate same-provider cooldown probes for the rest of that run.

Changes

  • src/agents/model-fallback.ts
    • track providers that already consumed a transient cooldown probe in the current fallback run
    • skip additional same-provider cooldown probes once one has already been attempted for that provider
    • log this as a normal skip_candidate fallback decision with inferred reason
  • src/agents/model-fallback.test.ts
    • add regression test covering: primary skipped (cooldown) -> first same-provider fallback probed+fails -> second same-provider fallback skipped -> cross-provider fallback succeeds
  • CHANGELOG.md
    • add Unreleased fix note

Validation

  • pnpm test src/agents/model-fallback.test.ts src/agents/model-fallback.probe.test.ts
  • pnpm exec vitest run -c vitest.e2e.config.ts src/agents/model-fallback.run-embedded.e2e.test.ts
  • pnpm check
  • pnpm build

Related

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 10, 2026

Greptile Summary

This PR fixes a fallback stall issue in multi-model same-provider chains by capping transient cooldown probing (for rate_limit / overloaded reasons) to one attempt per provider per fallback run. Previously, every same-provider candidate could independently trigger a cooldown-bypass probe, causing the fallback loop to stall on providers with long internal retry timeouts before ever reaching cross-provider candidates.

Key changes:

  • src/agents/model-fallback.ts: A cooldownProbeUsedProviders: Set<string> is initialized per-run and checked before allowing a transient cooldown probe. The first same-provider probe proceeds as before; subsequent same-provider candidates are skipped (logged as skip_candidate) and the loop moves on to the next candidate.
  • src/agents/model-fallback.test.ts: A new regression test verifies the four-step chain — primary skipped → first same-provider fallback probed and fails → second same-provider fallback skipped → cross-provider fallback succeeds — and the two-call mock implicitly asserts haiku is never invoked.
  • CHANGELOG.md: Unreleased fix entry added in the standard format.

Notable implementation detail: decision.markProbe is only ever true for primary candidates (markProbe: params.isPrimary && shouldProbe), so markProbeAttempt is never called for the non-primary candidates that the new guard skips — no probe-throttle slots are wasted on skipped fallbacks.

Minor observation: The billing case is included in the new guard's condition check (line 587–590), but resolveCooldownDecision already returns type: "skip" for all non-primary billing candidates before the guard is reached, so that branch is dead code in the current logic. It is harmless but may be slightly surprising to future readers.

Confidence Score: 5/5

  • This PR is safe to merge — the change is well-scoped, correctly ordered relative to existing guards, and backed by a targeted regression test.
  • The logic is minimal and contained to a single Set lookup inserted at the correct point in the fallback loop. The markProbeAttempt / probe-throttle interaction is unaffected because decision.markProbe is only true for primary candidates. The billing inclusion in the guard is dead code but harmless. The new test self-validates the skip by using a two-call mock that would fail if any additional same-provider candidate were unexpectedly executed.
  • No files require special attention.

Last reviewed commit: 5d40a50

@openclaw-barnacle openclaw-barnacle bot added agents Agent runtime and tooling size: S labels Mar 10, 2026
@aisle-research-bot
Copy link
Copy Markdown

aisle-research-bot bot commented Mar 10, 2026

🔒 Aisle Security Analysis

We found 2 potential security issue(s) in this PR:

# Severity Title
1 🔵 Low Cooldown probe limiter bypass via broad 'format' failure classification can stall cross-provider fallback (DoS)
2 🔵 Low Potential DoS via unguarded describeFailoverError() call on unknown thrown values

1. 🔵 Cooldown probe limiter bypass via broad 'format' failure classification can stall cross-provider fallback (DoS)

Property Value
Severity Low
CWE CWE-400
Location src/agents/model-fallback.ts:682-692

Description

In runWithModelFallback, the new per-provider cooldown probe limiter is bypassed when the first transient cooldown probe fails with a failure reason classified as format.

This can lead to multiple same-provider probe attempts (one per same-provider candidate) during a single fallback run, delaying cross-provider fallback and increasing latency/compute.

Why this is attacker-influenceable:

  • Probe-slot consumption is decided by describeFailoverError(err).reason.
  • describeFailoverError derives reason via resolveFailoverReasonFromError, which classifies any HTTP 400 as format (unless billing).
  • HTTP 400 responses can be triggered by untrusted/user-controlled content in many provider APIs (e.g., content/policy validation, malformed tool-call structures induced by prompt/tooling interactions), meaning a user can cause the probe to be labeled format.

Effect:

  • When the provider is already in transient cooldown (rate_limit/overloaded) and allowTransientCooldownProbe: true is enabled, a user-triggered 400/format failure will preserve the probe slot, allowing repeated probes on additional same-provider models before trying other providers. If each attempt has long upstream timeouts/retries, this becomes a per-request DoS vector.

Vulnerable code:

const probeFailureReason = describeFailoverError(err).reason;
const shouldPreserveTransientProbeSlot =
  probeFailureReason === "model_not_found" ||
  probeFailureReason === "format" || // preserves slot
  probeFailureReason === "auth" ||
  probeFailureReason === "auth_permanent" ||
  probeFailureReason === "session_expired";
if (!shouldPreserveTransientProbeSlot) {
  cooldownProbeUsedProviders.add(transientProbeProviderForAttempt);
}

Supporting classification behavior:

  • classifyFailoverReasonFromHttpStatus(400, ...) returns "format" for most 400s.

Recommendation

Tighten probe-slot preservation so that user-triggerable/broad classifications cannot bypass the per-provider probe limit.

Suggested changes:

  1. Consume the probe slot for format by default, or only preserve it for very narrow, explicitly detected cases.
  2. Alternatively, preserve based on explicit message patterns (e.g., only tool-call-id/schema errors), not on status === 400.
  3. Add a hard cap on total transient cooldown probes per fallback run (e.g., maxTransientCooldownProbesPerRun) to prevent worst-case stalls.

Example (simplest safe behavior — consume slot on format):

const probeFailureReason = describeFailoverError(err).reason;
const shouldPreserveTransientProbeSlot =
  probeFailureReason === "model_not_found" ||
  probeFailureReason === "auth" ||
  probeFailureReason === "auth_permanent" ||
  probeFailureReason === "session_expired";

cooldownProbeUsedProviders.add(transientProbeProviderForAttempt);// Optionally: if shouldPreserveTransientProbeSlot, remove/add back based on your intended semantics.

Or, if you need to preserve for some format errors:

const described = describeFailoverError(err);
const preserveForFormat =
  described.reason === "format" &&/tool_use\.id|tool call id was/i.test(described.message);

if (!(preserveForFormat || described.reason === "model_not_found")) {
  cooldownProbeUsedProviders.add(provider);
}

2. 🔵 Potential DoS via unguarded describeFailoverError() call on unknown thrown values

Property Value
Severity Low
CWE CWE-248
Location src/agents/model-fallback.ts:682-693

Description

The new transient cooldown probe tracking logic calls describeFailoverError(err).reason on an unknown caught value (err).

While describeFailoverError() looks safe for typical Error instances, it can throw for some valid JavaScript thrown values because it falls back to String(err) when it cannot extract a message:

  • describeFailoverError() computes const message = getErrorMessage(err) || String(err).
  • String(err) can throw (e.g., objects with no usable primitive conversion such as Object.create(null) or objects with toString/valueOf set to non-callable values).
  • This invocation is not wrapped in a try/catch, so a malformed/unexpected thrown value can crash the fallback loop during error handling (availability/DoS).

Vulnerable code (new call site):

const probeFailureReason = describeFailoverError(err).reason;

Relevant utility:

const message = getErrorMessage(err) || String(err);

Recommendation

Wrap the probe-reason extraction in a defensive try/catch (or use a safe formatter) so error handling can never throw while processing an already-failed attempt.

Example:

let probeFailureReason: FailoverReason | undefined;
try {
  probeFailureReason = describeFailoverError(err).reason;
} catch {
  probeFailureReason = undefined;
}

const shouldPreserveTransientProbeSlot =
  probeFailureReason === "model_not_found" ||
  probeFailureReason === "format" ||
  probeFailureReason === "auth" ||
  probeFailureReason === "auth_permanent" ||
  probeFailureReason === "session_expired";

Optionally harden describeFailoverError() itself by guarding the String(err) conversion:

let message = getErrorMessage(err);
if (!message) {
  try { message = String(err); } catch { message = Object.prototype.toString.call(err); }
}

Analyzed PR: #41711 at commit 8be8967

Last updated on: 2026-03-10T12:35:59Z

@cgdusek
Copy link
Copy Markdown
Contributor Author

cgdusek commented Mar 10, 2026

Thanks for the review pass and the note about the billing branch in the probe guard.

I followed up in d9609d2 to make the intent explicit:

  • keep allowTransientCooldownProbe for primary billing probe paths (needed by existing billing recovery tests), and
  • apply the one-probe-per-provider cap only for transient cooldown reasons (rate_limit / overloaded).

Local verification after the change:

  • pnpm test src/agents/model-fallback.test.ts src/agents/model-fallback.probe.test.ts
  • pnpm check

@cgdusek cgdusek force-pushed the codex/limit-cooldown-probe-per-provider branch from d9609d2 to 5fe63a6 Compare March 10, 2026 04:19
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5fe63a628c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@cgdusek
Copy link
Copy Markdown
Contributor Author

cgdusek commented Mar 10, 2026

Thanks for the security review and detailed write-up.

Triage outcome for this PR:

  • I validated the described routing behavior locally and in tests (pnpm test src/agents/model-fallback.test.ts).
  • This PR does not introduce new cross-provider fallback capability; it changes cooldown probe pacing within an already explicit fallback chain.

Security-model assessment (per SECURITY.md):

  • OpenClaw’s gateway model is trusted-operator, not adversarial multi-tenant on one shared gateway/agentDir.
  • Explicit fallback entries are treated as explicit operator intent (including cross-provider), and are intentionally not silently filtered by the model allowlist.
  • Risks that depend on untrusted operators sharing one gateway/config are listed out of scope in this repo’s security policy.

Given that, I’m treating this as policy/hardening guidance rather than a security bug in this patch. No additional code change is required in this PR for the documented trust boundaries.

If maintainers want, we can still track optional hardening separately (for example explicit cross-provider fallback gating) as a product policy feature.

@altaywtf altaywtf force-pushed the codex/limit-cooldown-probe-per-provider branch from 70e447e to 8be8967 Compare March 10, 2026 12:24
@altaywtf altaywtf merged commit 048e25c into openclaw:main Mar 10, 2026
28 checks passed
@altaywtf
Copy link
Copy Markdown
Member

Merged via squash.

Thanks @cgdusek!

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8be8967bcb

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +687 to +689
probeFailureReason === "auth" ||
probeFailureReason === "auth_permanent" ||
probeFailureReason === "session_expired";
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Consume transient probe slot on auth-like probe failures

The new shouldPreserveTransientProbeSlot logic preserves the provider slot for auth, auth_permanent, and session_expired, which allows duplicate same-provider cooldown probes to continue after credential/session failures. In runWithModelFallback, the auth profile store is loaded once before the candidate loop and not refreshed per attempt, so later candidates can still infer rate_limit/overloaded and pass back into probe mode unless the provider was added to cooldownProbeUsedProviders. In a fallback chain with multiple models on one provider, a first probe failing with 401/403/session expiry can therefore still probe subsequent same-provider models instead of moving to cross-provider fallback, reintroducing the stall this change is intended to prevent.

Useful? React with 👍 / 👎.

mrosmarin added a commit to mrosmarin/openclaw that referenced this pull request Mar 10, 2026
* main: (43 commits)
  docs: add openclaw#42173 to CHANGELOG — strip leaked model control tokens (openclaw#42216)
  Agents: align onPayload callback and OAuth imports
  docs: add Tengji (George) Zhang to maintainer table (openclaw#42190)
  fix: strip leaked model control tokens from user-facing text (openclaw#42173)
  Changelog: add unreleased March 9 entries
  chore: add .dev-state to .gitignore (openclaw#41848)
  fix(agents): avoid duplicate same-provider cooldown probes in fallback runs (openclaw#41711)
  fix(mattermost): preserve markdown formatting and native tables (openclaw#18655)
  feat(acp): add resumeSessionId to sessions_spawn for ACP session resume (openclaw#41847)
  ACPX: bump bundled acpx to 0.1.16 (openclaw#41975)
  mattermost: fix DM media upload for unprefixed user IDs (openclaw#29925)
  fix(msteams): use General channel conversation ID as team key for Bot Framework compatibility (openclaw#41838)
  fix(mattermost): read replyTo param in plugin handleAction send (openclaw#41176)
  fix(sandbox): pass real workspace to sessions_spawn when workspaceAccess is ro (openclaw#40757)
  fix(ui): replace Manual RPC text input with sorted method dropdown (openclaw#14967)
  CI: select Swift 6.2 toolchain for CodeQL (openclaw#41787)
  fix(agents): forward memory flush write path (openclaw#41761)
  fix(telegram): move network fallback to resolver-scoped dispatchers (openclaw#40740)
  fix(security): harden replaceMarkers() to catch space/underscore boundary marker variants (openclaw#35983)
  fix(web-search): recover OpenRouter Perplexity citations from message annotations (openclaw#40881)
  ...
Moshiii pushed a commit to Moshiii/openclaw that referenced this pull request Mar 11, 2026
…k runs (openclaw#41711)

Merged via squash.

Prepared head SHA: 8be8967
Co-authored-by: cgdusek <[email protected]>
Co-authored-by: altaywtf <[email protected]>
Reviewed-by: @altaywtf
Moshiii pushed a commit to Moshiii/openclaw that referenced this pull request Mar 11, 2026
…k runs (openclaw#41711)

Merged via squash.

Prepared head SHA: 8be8967
Co-authored-by: cgdusek <[email protected]>
Co-authored-by: altaywtf <[email protected]>
Reviewed-by: @altaywtf
frankekn pushed a commit to MoerAI/openclaw that referenced this pull request Mar 11, 2026
…k runs (openclaw#41711)

Merged via squash.

Prepared head SHA: 8be8967
Co-authored-by: cgdusek <[email protected]>
Co-authored-by: altaywtf <[email protected]>
Reviewed-by: @altaywtf
frankekn pushed a commit to Effet/openclaw that referenced this pull request Mar 11, 2026
…k runs (openclaw#41711)

Merged via squash.

Prepared head SHA: 8be8967
Co-authored-by: cgdusek <[email protected]>
Co-authored-by: altaywtf <[email protected]>
Reviewed-by: @altaywtf
frankekn pushed a commit to ImLukeF/openclaw that referenced this pull request Mar 11, 2026
…k runs (openclaw#41711)

Merged via squash.

Prepared head SHA: 8be8967
Co-authored-by: cgdusek <[email protected]>
Co-authored-by: altaywtf <[email protected]>
Reviewed-by: @altaywtf
dominicnunez pushed a commit to dominicnunez/openclaw that referenced this pull request Mar 11, 2026
…k runs (openclaw#41711)

Merged via squash.

Prepared head SHA: 8be8967
Co-authored-by: cgdusek <[email protected]>
Co-authored-by: altaywtf <[email protected]>
Reviewed-by: @altaywtf
@cgdusek cgdusek deleted the codex/limit-cooldown-probe-per-provider branch March 11, 2026 18:56
dhoman pushed a commit to dhoman/chrono-claw that referenced this pull request Mar 11, 2026
…k runs (openclaw#41711)

Merged via squash.

Prepared head SHA: 8be8967
Co-authored-by: cgdusek <[email protected]>
Co-authored-by: altaywtf <[email protected]>
Reviewed-by: @altaywtf
ahelpercn pushed a commit to ahelpercn/openclaw that referenced this pull request Mar 12, 2026
…k runs (openclaw#41711)

Merged via squash.

Prepared head SHA: 8be8967
Co-authored-by: cgdusek <[email protected]>
Co-authored-by: altaywtf <[email protected]>
Reviewed-by: @altaywtf
Ruijie-Ysp pushed a commit to Ruijie-Ysp/clawdbot that referenced this pull request Mar 12, 2026
…k runs (openclaw#41711)

Merged via squash.

Prepared head SHA: 8be8967
Co-authored-by: cgdusek <[email protected]>
Co-authored-by: altaywtf <[email protected]>
Reviewed-by: @altaywtf
Interstellar-code pushed a commit to Interstellar-code/operator1 that referenced this pull request Mar 16, 2026
…k runs (openclaw#41711)

Merged via squash.

Prepared head SHA: 8be8967
Co-authored-by: cgdusek <[email protected]>
Co-authored-by: altaywtf <[email protected]>
Reviewed-by: @altaywtf

(cherry picked from commit 048e25c)
senw-developers pushed a commit to senw-developers/va-openclaw that referenced this pull request Mar 17, 2026
…k runs (openclaw#41711)

Merged via squash.

Prepared head SHA: 8be8967
Co-authored-by: cgdusek <[email protected]>
Co-authored-by: altaywtf <[email protected]>
Reviewed-by: @altaywtf
V-Gutierrez pushed a commit to V-Gutierrez/openclaw-vendor that referenced this pull request Mar 17, 2026
…k runs (openclaw#41711)

Merged via squash.

Prepared head SHA: 8be8967
Co-authored-by: cgdusek <[email protected]>
Co-authored-by: altaywtf <[email protected]>
Reviewed-by: @altaywtf
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling size: S

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants