Skip to content

LLM error messages are over-normalized: raw error details lost in logs #51387

@zwj0117

Description

@zwj0117

Problem

When an LLM request fails, formatAssistantErrorText() normalizes many different errors into a single generic message like "LLM request timed out.". The raw error message is never logged, making it impossible to diagnose the actual failure.

Error patterns that all map to "LLM request timed out"

The ERROR_PATTERNS.timeout array matches 15+ patterns:

  • timeout, timed out
  • service unavailable
  • connection error, network error
  • fetch failed, socket hang up
  • ECONNREFUSED, ECONNRESET, ECONNABORTED
  • ETIMEDOUT, ENETUNREACH, EHOSTUNREACH
  • And more...

These represent very different failure modes (real timeout vs. connection refused vs. network error), but users and operators only see "LLM request timed out."

Impact

In our deployment using a custom provider (custom-idealab-alibaba-inc-com), we see frequent "LLM request timed out" errors in gateway logs. Some occur within 0.4 seconds of the request starting — clearly not a 30-second timeout. Without the raw error, we cannot determine whether the issue is:

  • An actual timeout
  • A connection reset
  • A DNS failure
  • A TLS error
  • The request being aborted by something else

Where the raw error is lost

In handleAgentEnd(), the error flows through:

  1. lastAssistant.errorMessage (raw) → formatAssistantErrorText()safeErrorText (normalized)
  2. safeErrorText is what gets logged via consoleMessage
  3. buildApiErrorObservationFields() further redacts the raw error

The raw errorMessage is never emitted to any log output.

Suggested fix

  1. Log the raw error alongside the formatted one — at minimum in debug/warn level:

    embedded run agent end: runId=... error=LLM request timed out. rawError=<original error>
    
  2. Consider differentiating error categories — instead of mapping everything to "timed out", use distinct user-facing messages:

    • "LLM request timed out (no response within Xs)"
    • "LLM request failed: connection error"
    • "LLM request failed: service unavailable"
  3. Make the LLM request timeout configurable per provider in openclaw.json:

    {
      "models": {
        "providers": {
          "my-provider": {
            "requestTimeoutMs": 120000
          }
        }
      }
    }

    Currently the timeout is hardcoded at 30 seconds (3e4 in GatewayClient constructor).

Environment

  • OpenClaw version: 2026.3.13
  • Provider: custom Anthropic-compatible proxy (anthropic-messages API)
  • Model: claude-opus-4-6
  • Channel: DingTalk

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions