fix(agents): avoid duplicate same-provider cooldown probes in fallback runs by cgdusek · Pull Request #41711 · openclaw/openclaw

cgdusek · 2026-03-10T03:56:48Z

Summary

limit cooldown-bypass probing to one attempt per provider per fallback run
preserve existing transient cooldown probe behavior (rate_limit / overloaded / billing) for the first candidate of that provider
add regression coverage for multi-model same-provider cooldown chains to ensure we continue to cross-provider fallbacks promptly

Why

In multi-model fallback chains, when all auth profiles for a provider are cooldowned, we currently allow cooldown-bypass probing on every same-provider candidate in one run. If that provider path has long internal retries/timeouts, fallback can spend most of the run repeatedly probing the same provider instead of reaching configured cross-provider candidates.

This patch keeps the first transient cooldown probe (so we still allow model-scoped recovery), but skips duplicate same-provider cooldown probes for the rest of that run.

Changes

src/agents/model-fallback.ts
- track providers that already consumed a transient cooldown probe in the current fallback run
- skip additional same-provider cooldown probes once one has already been attempted for that provider
- log this as a normal skip_candidate fallback decision with inferred reason
src/agents/model-fallback.test.ts
- add regression test covering: primary skipped (cooldown) -> first same-provider fallback probed+fails -> second same-provider fallback skipped -> cross-provider fallback succeeds
CHANGELOG.md
- add Unreleased fix note

Validation

pnpm test src/agents/model-fallback.test.ts src/agents/model-fallback.probe.test.ts
pnpm exec vitest run -c vitest.e2e.config.ts src/agents/model-fallback.run-embedded.e2e.test.ts
pnpm check
pnpm build

This PR fixes a fallback stall issue in multi-model same-provider chains by capping transient cooldown probing (for rate_limit / overloaded reasons) to one attempt per provider per fallback run. Previously, every same-provider candidate could independently trigger a cooldown-bypass probe, causing the fallback loop to stall on providers with long internal retry timeouts before ever reaching cross-provider candidates.

Key changes:

src/agents/model-fallback.ts: A cooldownProbeUsedProviders: Set<string> is initialized per-run and checked before allowing a transient cooldown probe. The first same-provider probe proceeds as before; subsequent same-provider candidates are skipped (logged as skip_candidate) and the loop moves on to the next candidate.
src/agents/model-fallback.test.ts: A new regression test verifies the four-step chain — primary skipped → first same-provider fallback probed and fails → second same-provider fallback skipped → cross-provider fallback succeeds — and the two-call mock implicitly asserts haiku is never invoked.
CHANGELOG.md: Unreleased fix entry added in the standard format.

Notable implementation detail: decision.markProbe is only ever true for primary candidates (markProbe: params.isPrimary && shouldProbe), so markProbeAttempt is never called for the non-primary candidates that the new guard skips — no probe-throttle slots are wasted on skipped fallbacks.

Minor observation: The billing case is included in the new guard's condition check (line 587–590), but resolveCooldownDecision already returns type: "skip" for all non-primary billing candidates before the guard is reached, so that branch is dead code in the current logic. It is harmless but may be slightly surprising to future readers.

Confidence Score: 5/5

This PR is safe to merge — the change is well-scoped, correctly ordered relative to existing guards, and backed by a targeted regression test.
The logic is minimal and contained to a single Set lookup inserted at the correct point in the fallback loop. The markProbeAttempt / probe-throttle interaction is unaffected because decision.markProbe is only true for primary candidates. The billing inclusion in the guard is dead code but harmless. The new test self-validates the skip by using a two-call mock that would fail if any additional same-provider candidate were unexpectedly executed.
No files require special attention.

_{Last reviewed commit: 5d40a50}

aisle-research-bot · 2026-03-10T04:16:52Z

🔒 Aisle Security Analysis

We found 2 potential security issue(s) in this PR:

#	Severity	Title
1	🔵 Low	Cooldown probe limiter bypass via broad 'format' failure classification can stall cross-provider fallback (DoS)
2	🔵 Low	Potential DoS via unguarded describeFailoverError() call on unknown thrown values

1. 🔵 Cooldown probe limiter bypass via broad 'format' failure classification can stall cross-provider fallback (DoS)

Property	Value
Severity	Low
CWE	CWE-400
Location	`src/agents/model-fallback.ts:682-692`

Description

In runWithModelFallback, the new per-provider cooldown probe limiter is bypassed when the first transient cooldown probe fails with a failure reason classified as format.

This can lead to multiple same-provider probe attempts (one per same-provider candidate) during a single fallback run, delaying cross-provider fallback and increasing latency/compute.

Why this is attacker-influenceable:

Probe-slot consumption is decided by describeFailoverError(err).reason.
describeFailoverError derives reason via resolveFailoverReasonFromError, which classifies any HTTP 400 as format (unless billing).
HTTP 400 responses can be triggered by untrusted/user-controlled content in many provider APIs (e.g., content/policy validation, malformed tool-call structures induced by prompt/tooling interactions), meaning a user can cause the probe to be labeled format.

Effect:

When the provider is already in transient cooldown (rate_limit/overloaded) and allowTransientCooldownProbe: true is enabled, a user-triggered 400/format failure will preserve the probe slot, allowing repeated probes on additional same-provider models before trying other providers. If each attempt has long upstream timeouts/retries, this becomes a per-request DoS vector.

Vulnerable code:

const probeFailureReason = describeFailoverError(err).reason;
const shouldPreserveTransientProbeSlot =
  probeFailureReason === "model_not_found" ||
  probeFailureReason === "format" || // preserves slot
  probeFailureReason === "auth" ||
  probeFailureReason === "auth_permanent" ||
  probeFailureReason === "session_expired";
if (!shouldPreserveTransientProbeSlot) {
  cooldownProbeUsedProviders.add(transientProbeProviderForAttempt);
}

Supporting classification behavior:

classifyFailoverReasonFromHttpStatus(400, ...) returns "format" for most 400s.

Recommendation

Tighten probe-slot preservation so that user-triggerable/broad classifications cannot bypass the per-provider probe limit.

Suggested changes:

Consume the probe slot for format by default, or only preserve it for very narrow, explicitly detected cases.
Alternatively, preserve based on explicit message patterns (e.g., only tool-call-id/schema errors), not on status === 400.
Add a hard cap on total transient cooldown probes per fallback run (e.g., maxTransientCooldownProbesPerRun) to prevent worst-case stalls.

Example (simplest safe behavior — consume slot on format):

const probeFailureReason = describeFailoverError(err).reason;
const shouldPreserveTransientProbeSlot =
  probeFailureReason === "model_not_found" ||
  probeFailureReason === "auth" ||
  probeFailureReason === "auth_permanent" ||
  probeFailureReason === "session_expired";

cooldownProbeUsedProviders.add(transientProbeProviderForAttempt);
// Optionally: if shouldPreserveTransientProbeSlot, remove/add back based on your intended semantics.

Or, if you need to preserve for some format errors:

const described = describeFailoverError(err);
const preserveForFormat =
  described.reason === "format" &&
  /tool_use\.id|tool call id was/i.test(described.message);

if (!(preserveForFormat || described.reason === "model_not_found")) {
  cooldownProbeUsedProviders.add(provider);
}

2. 🔵 Potential DoS via unguarded describeFailoverError() call on unknown thrown values

Property	Value
Severity	Low
CWE	CWE-248
Location	`src/agents/model-fallback.ts:682-693`

Description

The new transient cooldown probe tracking logic calls describeFailoverError(err).reason on an unknown caught value (err).

While describeFailoverError() looks safe for typical Error instances, it can throw for some valid JavaScript thrown values because it falls back to String(err) when it cannot extract a message:

describeFailoverError() computes const message = getErrorMessage(err) || String(err).
String(err) can throw (e.g., objects with no usable primitive conversion such as Object.create(null) or objects with toString/valueOf set to non-callable values).
This invocation is not wrapped in a try/catch, so a malformed/unexpected thrown value can crash the fallback loop during error handling (availability/DoS).

Vulnerable code (new call site):

const probeFailureReason = describeFailoverError(err).reason;

Relevant utility:

const message = getErrorMessage(err) || String(err);

Recommendation

Wrap the probe-reason extraction in a defensive try/catch (or use a safe formatter) so error handling can never throw while processing an already-failed attempt.

Example:

let probeFailureReason: FailoverReason | undefined;
try {
  probeFailureReason = describeFailoverError(err).reason;
} catch {
  probeFailureReason = undefined;
}

const shouldPreserveTransientProbeSlot =
  probeFailureReason === "model_not_found" ||
  probeFailureReason === "format" ||
  probeFailureReason === "auth" ||
  probeFailureReason === "auth_permanent" ||
  probeFailureReason === "session_expired";

Optionally harden describeFailoverError() itself by guarding the String(err) conversion:

let message = getErrorMessage(err);
if (!message) {
  try { message = String(err); } catch { message = Object.prototype.toString.call(err); }
}

Analyzed PR: #41711 at commit 8be8967

_{Last updated on: 2026-03-10T12:35:59Z}

cgdusek · 2026-03-10T04:17:39Z

Thanks for the review pass and the note about the billing branch in the probe guard.

I followed up in d9609d2 to make the intent explicit:

keep allowTransientCooldownProbe for primary billing probe paths (needed by existing billing recovery tests), and
apply the one-probe-per-provider cap only for transient cooldown reasons (rate_limit / overloaded).

Local verification after the change:

pnpm test src/agents/model-fallback.test.ts src/agents/model-fallback.probe.test.ts
pnpm check

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5fe63a628c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

src/agents/model-fallback.ts

cgdusek · 2026-03-10T04:54:46Z

Thanks for the security review and detailed write-up.

Triage outcome for this PR:

I validated the described routing behavior locally and in tests (pnpm test src/agents/model-fallback.test.ts).
This PR does not introduce new cross-provider fallback capability; it changes cooldown probe pacing within an already explicit fallback chain.

Security-model assessment (per SECURITY.md):

OpenClaw’s gateway model is trusted-operator, not adversarial multi-tenant on one shared gateway/agentDir.
Explicit fallback entries are treated as explicit operator intent (including cross-provider), and are intentionally not silently filtered by the model allowlist.
Risks that depend on untrusted operators sharing one gateway/config are listed out of scope in this repo’s security policy.

Given that, I’m treating this as policy/hardening guidance rather than a security bug in this patch. No additional code change is required in this PR for the documented trust boundaries.

If maintainers want, we can still track optional hardening separately (for example explicit cross-provider fallback gating) as a product policy feature.

altaywtf · 2026-03-10T12:26:53Z

Merged via squash.

Prepared head SHA: 8be8967bcb4be81f6abc5ff078644ec4efcfe7a0
Merge commit: 048e25c2b21d56962c482ea63f7fe78795194609

Thanks @cgdusek!

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8be8967bcb

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-10T12:32:33Z

src/agents/model-fallback.ts

+          probeFailureReason === "auth" ||
+          probeFailureReason === "auth_permanent" ||
+          probeFailureReason === "session_expired";


Consume transient probe slot on auth-like probe failures

The new shouldPreserveTransientProbeSlot logic preserves the provider slot for auth, auth_permanent, and session_expired, which allows duplicate same-provider cooldown probes to continue after credential/session failures. In runWithModelFallback, the auth profile store is loaded once before the candidate loop and not refreshed per attempt, so later candidates can still infer rate_limit/overloaded and pass back into probe mode unless the provider was added to cooldownProbeUsedProviders. In a fallback chain with multiple models on one provider, a first probe failing with 401/403/session expiry can therefore still probe subsequent same-provider models instead of moving to cross-provider fallback, reintroducing the stall this change is intended to prevent.

Useful? React with 👍 / 👎.

* main: (43 commits) docs: add openclaw#42173 to CHANGELOG — strip leaked model control tokens (openclaw#42216) Agents: align onPayload callback and OAuth imports docs: add Tengji (George) Zhang to maintainer table (openclaw#42190) fix: strip leaked model control tokens from user-facing text (openclaw#42173) Changelog: add unreleased March 9 entries chore: add .dev-state to .gitignore (openclaw#41848) fix(agents): avoid duplicate same-provider cooldown probes in fallback runs (openclaw#41711) fix(mattermost): preserve markdown formatting and native tables (openclaw#18655) feat(acp): add resumeSessionId to sessions_spawn for ACP session resume (openclaw#41847) ACPX: bump bundled acpx to 0.1.16 (openclaw#41975) mattermost: fix DM media upload for unprefixed user IDs (openclaw#29925) fix(msteams): use General channel conversation ID as team key for Bot Framework compatibility (openclaw#41838) fix(mattermost): read replyTo param in plugin handleAction send (openclaw#41176) fix(sandbox): pass real workspace to sessions_spawn when workspaceAccess is ro (openclaw#40757) fix(ui): replace Manual RPC text input with sorted method dropdown (openclaw#14967) CI: select Swift 6.2 toolchain for CodeQL (openclaw#41787) fix(agents): forward memory flush write path (openclaw#41761) fix(telegram): move network fallback to resolver-scoped dispatchers (openclaw#40740) fix(security): harden replaceMarkers() to catch space/underscore boundary marker variants (openclaw#35983) fix(web-search): recover OpenRouter Perplexity citations from message annotations (openclaw#40881) ...

@altaywtf

…k runs (openclaw#41711) Merged via squash. Prepared head SHA: 8be8967 Co-authored-by: cgdusek <[email protected]> Co-authored-by: altaywtf <[email protected]> Reviewed-by: @altaywtf

@altaywtf

…k runs (openclaw#41711) Merged via squash. Prepared head SHA: 8be8967 Co-authored-by: cgdusek <[email protected]> Co-authored-by: altaywtf <[email protected]> Reviewed-by: @altaywtf

@altaywtf

…k runs (openclaw#41711) Merged via squash. Prepared head SHA: 8be8967 Co-authored-by: cgdusek <[email protected]> Co-authored-by: altaywtf <[email protected]> Reviewed-by: @altaywtf

@altaywtf

…k runs (openclaw#41711) Merged via squash. Prepared head SHA: 8be8967 Co-authored-by: cgdusek <[email protected]> Co-authored-by: altaywtf <[email protected]> Reviewed-by: @altaywtf

@altaywtf

…k runs (openclaw#41711) Merged via squash. Prepared head SHA: 8be8967 Co-authored-by: cgdusek <[email protected]> Co-authored-by: altaywtf <[email protected]> Reviewed-by: @altaywtf

@altaywtf

…k runs (openclaw#41711) Merged via squash. Prepared head SHA: 8be8967 Co-authored-by: cgdusek <[email protected]> Co-authored-by: altaywtf <[email protected]> Reviewed-by: @altaywtf

@altaywtf

…k runs (openclaw#41711) Merged via squash. Prepared head SHA: 8be8967 Co-authored-by: cgdusek <[email protected]> Co-authored-by: altaywtf <[email protected]> Reviewed-by: @altaywtf

@altaywtf

…k runs (openclaw#41711) Merged via squash. Prepared head SHA: 8be8967 Co-authored-by: cgdusek <[email protected]> Co-authored-by: altaywtf <[email protected]> Reviewed-by: @altaywtf

@altaywtf

…k runs (openclaw#41711) Merged via squash. Prepared head SHA: 8be8967 Co-authored-by: cgdusek <[email protected]> Co-authored-by: altaywtf <[email protected]> Reviewed-by: @altaywtf

@altaywtf

…k runs (openclaw#41711) Merged via squash. Prepared head SHA: 8be8967 Co-authored-by: cgdusek <[email protected]> Co-authored-by: altaywtf <[email protected]> Reviewed-by: @altaywtf (cherry picked from commit 048e25c)

@altaywtf

…k runs (openclaw#41711) Merged via squash. Prepared head SHA: 8be8967 Co-authored-by: cgdusek <[email protected]> Co-authored-by: altaywtf <[email protected]> Reviewed-by: @altaywtf

@altaywtf

…k runs (openclaw#41711) Merged via squash. Prepared head SHA: 8be8967 Co-authored-by: cgdusek <[email protected]> Co-authored-by: altaywtf <[email protected]> Reviewed-by: @altaywtf

openclaw-barnacle bot added agents Agent runtime and tooling size: S labels Mar 10, 2026

cgdusek force-pushed the codex/limit-cooldown-probe-per-provider branch from d9609d2 to 5fe63a6 Compare March 10, 2026 04:19

chatgpt-codex-connector bot reviewed Mar 10, 2026

View reviewed changes

src/agents/model-fallback.ts Outdated Show resolved Hide resolved

cgdusek and others added 5 commits March 10, 2026 15:18

agents: limit duplicate cooldown probes per provider

95eb951

agents: scope duplicate-probe cap to transient cooldown reasons

47645d9

agents: preserve transient probe slot on model-not-found probes

11e2ece

Changelog: reference fallback probe PR

1f27457

Changelog: append fallback probe entry

8be8967

altaywtf force-pushed the codex/limit-cooldown-probe-per-provider branch from 70e447e to 8be8967 Compare March 10, 2026 12:24

altaywtf merged commit 048e25c into openclaw:main Mar 10, 2026
28 checks passed

chatgpt-codex-connector bot reviewed Mar 10, 2026

View reviewed changes

cgdusek deleted the codex/limit-cooldown-probe-per-provider branch March 11, 2026 18:56

github-actions bot mentioned this pull request Mar 12, 2026

上游更新: v2026.3.11 — 14 P0 + 32 P1 待合并 jiulingyun/openclaw-cn#499

Open

wboudy mentioned this pull request Mar 12, 2026

fix(models): probe-safe cooldown handling and compatible fallback attribution #15881

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(agents): avoid duplicate same-provider cooldown probes in fallback runs#41711

fix(agents): avoid duplicate same-provider cooldown probes in fallback runs#41711
altaywtf merged 5 commits intoopenclaw:mainfrom
cgdusek:codex/limit-cooldown-probe-per-provider

cgdusek commented Mar 10, 2026

Uh oh!

greptile-apps bot commented Mar 10, 2026

Uh oh!

aisle-research-bot bot commented Mar 10, 2026 •

edited

Loading

Uh oh!

cgdusek commented Mar 10, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

cgdusek commented Mar 10, 2026

Uh oh!

Uh oh!

altaywtf commented Mar 10, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

cgdusek commented Mar 10, 2026

Summary

Why

Changes

Validation

Related

Uh oh!

greptile-apps bot commented Mar 10, 2026

Greptile Summary

Confidence Score: 5/5

Uh oh!

aisle-research-bot bot commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔒 Aisle Security Analysis

1. 🔵 Cooldown probe limiter bypass via broad 'format' failure classification can stall cross-provider fallback (DoS)

Description

Recommendation

2. 🔵 Potential DoS via unguarded describeFailoverError() call on unknown thrown values

Description

Recommendation

Uh oh!

cgdusek commented Mar 10, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

cgdusek commented Mar 10, 2026

Uh oh!

Uh oh!

altaywtf commented Mar 10, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

aisle-research-bot bot commented Mar 10, 2026 •

edited

Loading