fix(agents): avoid duplicate same-provider cooldown probes in fallback runs#41711
Conversation
Greptile SummaryThis PR fixes a fallback stall issue in multi-model same-provider chains by capping transient cooldown probing (for Key changes:
Notable implementation detail: Minor observation: The Confidence Score: 5/5
Last reviewed commit: 5d40a50 |
🔒 Aisle Security AnalysisWe found 2 potential security issue(s) in this PR:
1. 🔵 Cooldown probe limiter bypass via broad 'format' failure classification can stall cross-provider fallback (DoS)
DescriptionIn This can lead to multiple same-provider probe attempts (one per same-provider candidate) during a single fallback run, delaying cross-provider fallback and increasing latency/compute. Why this is attacker-influenceable:
Effect:
Vulnerable code: const probeFailureReason = describeFailoverError(err).reason;
const shouldPreserveTransientProbeSlot =
probeFailureReason === "model_not_found" ||
probeFailureReason === "format" || // preserves slot
probeFailureReason === "auth" ||
probeFailureReason === "auth_permanent" ||
probeFailureReason === "session_expired";
if (!shouldPreserveTransientProbeSlot) {
cooldownProbeUsedProviders.add(transientProbeProviderForAttempt);
}Supporting classification behavior:
RecommendationTighten probe-slot preservation so that user-triggerable/broad classifications cannot bypass the per-provider probe limit. Suggested changes:
Example (simplest safe behavior — consume slot on const probeFailureReason = describeFailoverError(err).reason;
const shouldPreserveTransientProbeSlot =
probeFailureReason === "model_not_found" ||
probeFailureReason === "auth" ||
probeFailureReason === "auth_permanent" ||
probeFailureReason === "session_expired";
cooldownProbeUsedProviders.add(transientProbeProviderForAttempt);
// Optionally: if shouldPreserveTransientProbeSlot, remove/add back based on your intended semantics.Or, if you need to preserve for some format errors: const described = describeFailoverError(err);
const preserveForFormat =
described.reason === "format" &&
/tool_use\.id|tool call id was/i.test(described.message);
if (!(preserveForFormat || described.reason === "model_not_found")) {
cooldownProbeUsedProviders.add(provider);
}2. 🔵 Potential DoS via unguarded describeFailoverError() call on unknown thrown values
DescriptionThe new transient cooldown probe tracking logic calls While
Vulnerable code (new call site): const probeFailureReason = describeFailoverError(err).reason;Relevant utility: const message = getErrorMessage(err) || String(err);RecommendationWrap the probe-reason extraction in a defensive Example: let probeFailureReason: FailoverReason | undefined;
try {
probeFailureReason = describeFailoverError(err).reason;
} catch {
probeFailureReason = undefined;
}
const shouldPreserveTransientProbeSlot =
probeFailureReason === "model_not_found" ||
probeFailureReason === "format" ||
probeFailureReason === "auth" ||
probeFailureReason === "auth_permanent" ||
probeFailureReason === "session_expired";Optionally harden let message = getErrorMessage(err);
if (!message) {
try { message = String(err); } catch { message = Object.prototype.toString.call(err); }
}Analyzed PR: #41711 at commit Last updated on: 2026-03-10T12:35:59Z |
|
Thanks for the review pass and the note about the I followed up in d9609d2 to make the intent explicit:
Local verification after the change:
|
d9609d2 to
5fe63a6
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5fe63a628c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
Thanks for the security review and detailed write-up. Triage outcome for this PR:
Security-model assessment (per
Given that, I’m treating this as policy/hardening guidance rather than a security bug in this patch. No additional code change is required in this PR for the documented trust boundaries. If maintainers want, we can still track optional hardening separately (for example explicit cross-provider fallback gating) as a product policy feature. |
70e447e to
8be8967
Compare
|
Merged via squash.
Thanks @cgdusek! |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8be8967bcb
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| probeFailureReason === "auth" || | ||
| probeFailureReason === "auth_permanent" || | ||
| probeFailureReason === "session_expired"; |
There was a problem hiding this comment.
Consume transient probe slot on auth-like probe failures
The new shouldPreserveTransientProbeSlot logic preserves the provider slot for auth, auth_permanent, and session_expired, which allows duplicate same-provider cooldown probes to continue after credential/session failures. In runWithModelFallback, the auth profile store is loaded once before the candidate loop and not refreshed per attempt, so later candidates can still infer rate_limit/overloaded and pass back into probe mode unless the provider was added to cooldownProbeUsedProviders. In a fallback chain with multiple models on one provider, a first probe failing with 401/403/session expiry can therefore still probe subsequent same-provider models instead of moving to cross-provider fallback, reintroducing the stall this change is intended to prevent.
Useful? React with 👍 / 👎.
* main: (43 commits) docs: add openclaw#42173 to CHANGELOG — strip leaked model control tokens (openclaw#42216) Agents: align onPayload callback and OAuth imports docs: add Tengji (George) Zhang to maintainer table (openclaw#42190) fix: strip leaked model control tokens from user-facing text (openclaw#42173) Changelog: add unreleased March 9 entries chore: add .dev-state to .gitignore (openclaw#41848) fix(agents): avoid duplicate same-provider cooldown probes in fallback runs (openclaw#41711) fix(mattermost): preserve markdown formatting and native tables (openclaw#18655) feat(acp): add resumeSessionId to sessions_spawn for ACP session resume (openclaw#41847) ACPX: bump bundled acpx to 0.1.16 (openclaw#41975) mattermost: fix DM media upload for unprefixed user IDs (openclaw#29925) fix(msteams): use General channel conversation ID as team key for Bot Framework compatibility (openclaw#41838) fix(mattermost): read replyTo param in plugin handleAction send (openclaw#41176) fix(sandbox): pass real workspace to sessions_spawn when workspaceAccess is ro (openclaw#40757) fix(ui): replace Manual RPC text input with sorted method dropdown (openclaw#14967) CI: select Swift 6.2 toolchain for CodeQL (openclaw#41787) fix(agents): forward memory flush write path (openclaw#41761) fix(telegram): move network fallback to resolver-scoped dispatchers (openclaw#40740) fix(security): harden replaceMarkers() to catch space/underscore boundary marker variants (openclaw#35983) fix(web-search): recover OpenRouter Perplexity citations from message annotations (openclaw#40881) ...
…k runs (openclaw#41711) Merged via squash. Prepared head SHA: 8be8967 Co-authored-by: cgdusek <[email protected]> Co-authored-by: altaywtf <[email protected]> Reviewed-by: @altaywtf
…k runs (openclaw#41711) Merged via squash. Prepared head SHA: 8be8967 Co-authored-by: cgdusek <[email protected]> Co-authored-by: altaywtf <[email protected]> Reviewed-by: @altaywtf
…k runs (openclaw#41711) Merged via squash. Prepared head SHA: 8be8967 Co-authored-by: cgdusek <[email protected]> Co-authored-by: altaywtf <[email protected]> Reviewed-by: @altaywtf
…k runs (openclaw#41711) Merged via squash. Prepared head SHA: 8be8967 Co-authored-by: cgdusek <[email protected]> Co-authored-by: altaywtf <[email protected]> Reviewed-by: @altaywtf
…k runs (openclaw#41711) Merged via squash. Prepared head SHA: 8be8967 Co-authored-by: cgdusek <[email protected]> Co-authored-by: altaywtf <[email protected]> Reviewed-by: @altaywtf
…k runs (openclaw#41711) Merged via squash. Prepared head SHA: 8be8967 Co-authored-by: cgdusek <[email protected]> Co-authored-by: altaywtf <[email protected]> Reviewed-by: @altaywtf
…k runs (openclaw#41711) Merged via squash. Prepared head SHA: 8be8967 Co-authored-by: cgdusek <[email protected]> Co-authored-by: altaywtf <[email protected]> Reviewed-by: @altaywtf
…k runs (openclaw#41711) Merged via squash. Prepared head SHA: 8be8967 Co-authored-by: cgdusek <[email protected]> Co-authored-by: altaywtf <[email protected]> Reviewed-by: @altaywtf
…k runs (openclaw#41711) Merged via squash. Prepared head SHA: 8be8967 Co-authored-by: cgdusek <[email protected]> Co-authored-by: altaywtf <[email protected]> Reviewed-by: @altaywtf
…k runs (openclaw#41711) Merged via squash. Prepared head SHA: 8be8967 Co-authored-by: cgdusek <[email protected]> Co-authored-by: altaywtf <[email protected]> Reviewed-by: @altaywtf (cherry picked from commit 048e25c)
…k runs (openclaw#41711) Merged via squash. Prepared head SHA: 8be8967 Co-authored-by: cgdusek <[email protected]> Co-authored-by: altaywtf <[email protected]> Reviewed-by: @altaywtf
…k runs (openclaw#41711) Merged via squash. Prepared head SHA: 8be8967 Co-authored-by: cgdusek <[email protected]> Co-authored-by: altaywtf <[email protected]> Reviewed-by: @altaywtf
Summary
rate_limit/overloaded/billing) for the first candidate of that providerWhy
In multi-model fallback chains, when all auth profiles for a provider are cooldowned, we currently allow cooldown-bypass probing on every same-provider candidate in one run. If that provider path has long internal retries/timeouts, fallback can spend most of the run repeatedly probing the same provider instead of reaching configured cross-provider candidates.
This patch keeps the first transient cooldown probe (so we still allow model-scoped recovery), but skips duplicate same-provider cooldown probes for the rest of that run.
Changes
src/agents/model-fallback.tsskip_candidatefallback decision with inferred reasonsrc/agents/model-fallback.test.tsCHANGELOG.mdValidation
pnpm test src/agents/model-fallback.test.ts src/agents/model-fallback.probe.test.tspnpm exec vitest run -c vitest.e2e.config.ts src/agents/model-fallback.run-embedded.e2e.test.tspnpm checkpnpm buildRelated