Skip to content

[Bug]: Provider cooldown reports 'rate_limit' regardless of actual failure reason (timeout, billing, etc.) #5240

@Alf-Bee

Description

@Alf-Bee

Summary

When a provider enters cooldown due to a timeout error, subsequent error messages incorrectly report the reason as rate_limit instead of the actual cause (timeout). This makes debugging infrastructure issues (like unreachable Ollama servers or dropped SSH tunnels) very difficult.

Observed Behavior

Error logs show:

21:04:54.568Z: ollama/qwen2.5:32b-instruct-q4_K_M: LLM request timed out. (unknown)
21:04:54.570Z: ollama/...Provider ollama is in cooldown (all profiles unavailable) (rate_limit)

The first message correctly identifies the timeout. But immediately after, when the provider is in cooldown, it reports rate_limit instead of timeout.

Root Cause

In dist/agents/model-fallback.js (line ~165), when a provider is skipped due to cooldown, the reason is hardcoded:

if (profileIds.length > 0 && !isAnyProfileAvailable) {
    attempts.push({
        provider: candidate.provider,
        model: candidate.model,
        error: `Provider ${candidate.provider} is in cooldown (all profiles unavailable)`,
        reason: "rate_limit",  // ← HARDCODED - should use actual reason
    });
    continue;
}

The actual failure reason IS correctly stored in auth-profiles.json under usageStats[profileId].failureCounts:

{
  "ollama:local": {
    "failureCounts": {
      "timeout": 1  // ← Correct reason is stored here
    }
  }
}

But this information is not used when reporting why the provider is in cooldown.

Expected Behavior

The error message should report the actual reason that caused the cooldown:

  • If cooldown was caused by timeout → report (timeout)
  • If cooldown was caused by rate_limit → report (rate_limit)
  • If cooldown was caused by billing → report (billing)

Impact

  1. Misleading error messages - Users see "rate_limit" when the actual issue is a timeout/connectivity problem
  2. Difficult debugging - Can't distinguish between API rate limits vs infrastructure issues (server down, SSH tunnel dropped, network issues)
  3. Incorrect assumptions - Operators might wait for "rate limit reset" when the actual fix is restarting a service

Environment

  • OpenClaw version: v0.0.929
  • Provider affected: ollama (but bug affects all providers)
  • Actual cause: Ollama server unreachable (SSH tunnel issues)

Suggested Fix

The cooldown skip logic should read the actual reason from failureCounts and report it:

if (profileIds.length > 0 && !isAnyProfileAvailable) {
    // Get the actual reason from the profile's failure counts
    const profileStats = authStore.usageStats?.[profileIds[0]];
    const actualReason = profileStats?.failureCounts 
        ? Object.keys(profileStats.failureCounts).sort((a, b) => 
            (profileStats.failureCounts[b] ?? 0) - (profileStats.failureCounts[a] ?? 0)
          )[0] ?? "unknown"
        : "unknown";
    
    attempts.push({
        provider: candidate.provider,
        model: candidate.model,
        error: `Provider ${candidate.provider} is in cooldown (all profiles unavailable)`,
        reason: actualReason,  // ← Use actual reason
    });
    continue;
}

Related

The (unknown) categorization for timeout errors (seen in the first log line) may also be worth investigating - timeouts should be consistently categorized as timeout, not unknown.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions