Skip to content

fix(agent): clarify embedded transport errors#51419

Merged
scoootscooob merged 2 commits intoopenclaw:mainfrom
scoootscooob:codex/clarify-embedded-transport-errors
Mar 21, 2026
Merged

fix(agent): clarify embedded transport errors#51419
scoootscooob merged 2 commits intoopenclaw:mainfrom
scoootscooob:codex/clarify-embedded-transport-errors

Conversation

@scoootscooob
Copy link
Copy Markdown
Contributor

Summary

Describe the problem and fix in 2-5 bullets:

  • Problem: embedded agent failures were collapsing many transport/network errors into the same LLM request timed out. message.
  • Why it matters: operators could not distinguish connection refused, DNS failures, and interrupted sockets from real timeouts, even in lifecycle logs.
  • What changed: added safe transport-specific error copy for common network failures and included the already-redacted raw error preview in the lifecycle console message.
  • What did NOT change (scope boundary): this does not expose raw provider errors to chat, change request timeout configuration, or widen any logging surface beyond the existing redacted observation fields.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

User-visible / Behavior Changes

  • Embedded agent failures now distinguish common transport failures such as connection refused, DNS lookup failures, interrupted sockets, and generic network errors instead of flattening them all to LLM request timed out.
  • Lifecycle console logging now appends the existing redacted rawErrorPreview for operator diagnosis.

Security Impact (required)

  • New permissions/capabilities? (No)
  • Secrets/tokens handling changed? (No)
  • New/changed network calls? (No)
  • Command/tool execution surface changed? (No)
  • Data access scope changed? (No)
  • If any Yes, explain risk + mitigation:

Repro + Verification

Environment

  • OS: macOS
  • Runtime/container: Node 22 / pnpm test wrapper
  • Model/provider: synthetic embedded-agent error fixtures
  • Integration/channel (if any): embedded agent lifecycle logging
  • Relevant config (redacted): none required

Steps

  1. Reproduce formatAssistantErrorText() with transport-style failures such as ECONNREFUSED, ENOTFOUND, and socket hang up.
  2. Run the embedded lifecycle handler with an assistant error and inspect the emitted lifecycle metadata.
  3. Run targeted tests covering formatter, sanitization, and lifecycle logging.

Expected

  • Connection-refused, DNS, and interrupted-connection failures get distinct safe copy.
  • Lifecycle console output preserves the redacted raw preview for diagnosis.

Actual

  • Before this patch, the same failures were normalized to LLM request timed out. and the console line omitted the redacted raw preview.

Evidence

Attach at least one:

  • Failing test/log before + passing after
  • Trace/log snippets
  • Screenshot/recording
  • Perf numbers (if relevant)

Human Verification (required)

What you personally verified (not just CI), and how:

  • Verified scenarios:
    • pnpm test -- src/agents/pi-embedded-helpers.formatassistanterrortext.test.ts
    • pnpm test -- src/agents/pi-embedded-helpers.sanitizeuserfacingtext.test.ts
    • pnpm test -- src/agents/pi-embedded-subscribe.handlers.lifecycle.test.ts
    • pnpm test -- extensions/discord/src/monitor/provider.lifecycle.test.ts
  • Edge cases checked:
    • rate limit and overload paths still take precedence over transport classification
    • lifecycle console output stays sanitized while appending redacted raw previews
  • What you did not verify:
    • full repo test suite
    • live provider traffic against a real external endpoint

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

If a bot review conversation is addressed by this PR, resolve that conversation yourself. Do not leave bot review conversation cleanup for maintainers.

Compatibility / Migration

  • Backward compatible? (Yes)
  • Config/env changes? (No)
  • Migration needed? (No)
  • If yes, exact upgrade steps:

Failure Recovery (if this breaks)

  • How to disable/revert this change quickly: revert commit 9521f48
  • Files/config to restore:
    • src/agents/pi-embedded-helpers/errors.ts
    • src/agents/pi-embedded-subscribe.handlers.lifecycle.ts
  • Known bad symptoms reviewers should watch for:
    • transport failures unexpectedly falling through to generic raw HTTP formatting
    • lifecycle console messages including unsanitized content

Risks and Mitigations

List only real risks for this PR. Add/remove entries as needed. If none, write None.

  • Risk: a transport-specific matcher could accidentally steal a more specific classification.
    • Mitigation: transport classification runs after rate-limit/overload classification, and targeted tests cover the new branches.

@openclaw-barnacle openclaw-barnacle bot added agents Agent runtime and tooling size: S maintainer Maintainer-authored PR labels Mar 21, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 21, 2026

Greptile Summary

This PR improves embedded agent error observability by distinguishing common transport/network failures (ECONNREFUSED, ENOTFOUND, ECONNRESET, socket hang-up, etc.) from the generic "LLM request timed out." message they previously collapsed into, and by appending a sanitized raw error preview to the lifecycle console log line.

Key changes:

  • New formatTransportErrorCopy function in errors.ts maps POSIX network errno codes and related string patterns to safe, operator-readable copy; it is inserted in the classification pipeline after rate-limit/overload checks and before the existing timeout check.
  • handleAgentEnd in the lifecycle handler now appends rawError=<redacted-preview> to the console message when a raw preview is available.
  • Three new formatAssistantErrorText tests and one new sanitizeUserFacingText test cover the new branches; lifecycle tests are updated to reflect the expanded message format.
  • Scope is correctly bounded: no raw provider errors are exposed to chat, no request timeout configuration is changed, and the logging surface is not widened beyond the existing redacted observation fields.

Suggestions:

  • lower.includes("dns") (line 83 of errors.ts) is an unanchored 3-character substring that can produce false positives when a hostname, JSON key, or other field in an error message happens to contain "dns". The surrounding matchers (ENOTFOUND, EAI_AGAIN, getaddrinfo, no such host) already cover all standard DNS failure signals; the bare string adds fragility. Consider replacing it with /\bdns\b/i or removing it.
  • Transport errors are intentionally excluded from classifyFailoverReason, meaning no failover is attempted on ECONNREFUSED/ENOTFOUND/etc. This is called out as an explicit scope boundary in the PR, but a brief comment in the code would help future maintainers avoid confusion.

Confidence Score: 4/5

  • Safe to merge with one minor false-positive risk in the DNS substring match worth addressing before or shortly after merge.
  • The core change is straightforward and well-tested: transport error classification is layered in after existing, higher-priority checks, and the lifecycle console change only appends an already-sanitized field. The one issue worth noting is lower.includes("dns") — a 3-character bare substring that could misclassify non-DNS errors containing that sequence. Everything else, including sanitization, ordering of checks, and scope boundary adherence, looks correct.
  • src/agents/pi-embedded-helpers/errors.ts — specifically the lower.includes("dns") check in formatTransportErrorCopy.
Prompt To Fix All With AI
This is a comment left during a code review.
Path: src/agents/pi-embedded-helpers/errors.ts
Line: 83

Comment:
**`dns` substring match is too broad**

`lower.includes("dns")` is a 3-character bare-substring check with no word boundary. Any error message that contains "dns" as part of a hostname, a JSON key, or an infrastructure path would be misclassified as a DNS resolution failure. For example, an SSL handshake failure from a host like `dns-cache-01.provider.com` would be reported as "DNS lookup for the provider endpoint failed." even though the resolver succeeded.

The other matchers in this block are far more specific: `\benotfound\b`, `\beai_again\b`, `getaddrinfo`, and `no such host` already cover every standard DNS-failure signal. The bare `"dns"` fallback adds false-positive risk without closing a meaningful coverage gap.

Consider removing the `lower.includes("dns")` check, or tightening it to a word-boundary pattern:

```suggestion
    /\benotfound\b|\beai_again\b/i.test(raw) ||
    lower.includes("getaddrinfo") ||
    lower.includes("no such host") ||
    /\bdns\b/i.test(raw)
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/agents/pi-embedded-helpers/errors.ts
Line: 56-105

Comment:
**Transport errors not classified as failover reasons**

`formatTransportErrorCopy` returns friendly user-facing copy for transport errors (ECONNREFUSED, ENOTFOUND, socket hang up, etc.), but `classifyFailoverReason` is never updated to handle these cases. As a result, these errors will return `null` from `classifyFailoverReason`, meaning no failover attempt is made when a transport failure occurs, even though a retry might succeed (e.g., a transient connection reset or DNS blip).

This is noted as an explicit scope boundary in the PR description ("does not change request timeout configuration"), so this may be intentional. If so, a short comment documenting why transport errors intentionally bypass failover would help future maintainers avoid confusion.

How can I resolve this? If you propose a fix, please make it concise.

Last reviewed commit: "Agent: clarify embed..."

/\benotfound\b|\beai_again\b/i.test(raw) ||
lower.includes("getaddrinfo") ||
lower.includes("no such host") ||
lower.includes("dns")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 dns substring match is too broad

lower.includes("dns") is a 3-character bare-substring check with no word boundary. Any error message that contains "dns" as part of a hostname, a JSON key, or an infrastructure path would be misclassified as a DNS resolution failure. For example, an SSL handshake failure from a host like dns-cache-01.provider.com would be reported as "DNS lookup for the provider endpoint failed." even though the resolver succeeded.

The other matchers in this block are far more specific: \benotfound\b, \beai_again\b, getaddrinfo, and no such host already cover every standard DNS-failure signal. The bare "dns" fallback adds false-positive risk without closing a meaningful coverage gap.

Consider removing the lower.includes("dns") check, or tightening it to a word-boundary pattern:

Suggested change
lower.includes("dns")
/\benotfound\b|\beai_again\b/i.test(raw) ||
lower.includes("getaddrinfo") ||
lower.includes("no such host") ||
/\bdns\b/i.test(raw)
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/agents/pi-embedded-helpers/errors.ts
Line: 83

Comment:
**`dns` substring match is too broad**

`lower.includes("dns")` is a 3-character bare-substring check with no word boundary. Any error message that contains "dns" as part of a hostname, a JSON key, or an infrastructure path would be misclassified as a DNS resolution failure. For example, an SSL handshake failure from a host like `dns-cache-01.provider.com` would be reported as "DNS lookup for the provider endpoint failed." even though the resolver succeeded.

The other matchers in this block are far more specific: `\benotfound\b`, `\beai_again\b`, `getaddrinfo`, and `no such host` already cover every standard DNS-failure signal. The bare `"dns"` fallback adds false-positive risk without closing a meaningful coverage gap.

Consider removing the `lower.includes("dns")` check, or tightening it to a word-boundary pattern:

```suggestion
    /\benotfound\b|\beai_again\b/i.test(raw) ||
    lower.includes("getaddrinfo") ||
    lower.includes("no such host") ||
    /\bdns\b/i.test(raw)
```

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +56 to +105
function formatTransportErrorCopy(raw: string): string | undefined {
if (!raw) {
return undefined;
}
const lower = raw.toLowerCase();

if (
/\beconnrefused\b/i.test(raw) ||
lower.includes("connection refused") ||
lower.includes("actively refused")
) {
return "LLM request failed: connection refused by the provider endpoint.";
}

if (
/\beconnreset\b|\beconnaborted\b|\benetreset\b|\bepipe\b/i.test(raw) ||
lower.includes("socket hang up") ||
lower.includes("connection reset") ||
lower.includes("connection aborted")
) {
return "LLM request failed: network connection was interrupted.";
}

if (
/\benotfound\b|\beai_again\b/i.test(raw) ||
lower.includes("getaddrinfo") ||
lower.includes("no such host") ||
lower.includes("dns")
) {
return "LLM request failed: DNS lookup for the provider endpoint failed.";
}

if (
/\benetunreach\b|\behostunreach\b|\behostdown\b/i.test(raw) ||
lower.includes("network is unreachable") ||
lower.includes("host is unreachable")
) {
return "LLM request failed: the provider endpoint is unreachable from this host.";
}

if (
lower.includes("fetch failed") ||
lower.includes("connection error") ||
lower.includes("network request failed")
) {
return "LLM request failed: network connection error.";
}

return undefined;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Transport errors not classified as failover reasons

formatTransportErrorCopy returns friendly user-facing copy for transport errors (ECONNREFUSED, ENOTFOUND, socket hang up, etc.), but classifyFailoverReason is never updated to handle these cases. As a result, these errors will return null from classifyFailoverReason, meaning no failover attempt is made when a transport failure occurs, even though a retry might succeed (e.g., a transient connection reset or DNS blip).

This is noted as an explicit scope boundary in the PR description ("does not change request timeout configuration"), so this may be intentional. If so, a short comment documenting why transport errors intentionally bypass failover would help future maintainers avoid confusion.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/agents/pi-embedded-helpers/errors.ts
Line: 56-105

Comment:
**Transport errors not classified as failover reasons**

`formatTransportErrorCopy` returns friendly user-facing copy for transport errors (ECONNREFUSED, ENOTFOUND, socket hang up, etc.), but `classifyFailoverReason` is never updated to handle these cases. As a result, these errors will return `null` from `classifyFailoverReason`, meaning no failover attempt is made when a transport failure occurs, even though a retry might succeed (e.g., a transient connection reset or DNS blip).

This is noted as an explicit scope boundary in the PR description ("does not change request timeout configuration"), so this may be intentional. If so, a short comment documenting why transport errors intentionally bypass failover would help future maintainers avoid confusion.

How can I resolve this? If you propose a fix, please make it concise.

@scoootscooob scoootscooob self-assigned this Mar 21, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9521f48b85

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines 62 to +65
model: lastAssistant.model,
provider: lastAssistant.provider,
...observedError,
consoleMessage: `embedded run agent end: runId=${safeRunId} isError=true model=${safeModel} provider=${safeProvider} error=${safeErrorText}`,
consoleMessage: `embedded run agent end: runId=${safeRunId} isError=true model=${safeModel} provider=${safeProvider} error=${safeErrorText}${rawErrorConsoleSuffix}`,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Stop appending raw error previews to consoleMessage

In pretty/compact console mode, consoleMessage is the only text that gets printed (src/logging/subsystem.ts:193-235), so adding rawErrorPreview here widens the console log surface from the previous friendly-only line to raw provider payload text. That preview is not fully redacted by default—src/agents/pi-embedded-error-observation.test.ts:133-145 shows buildApiErrorObservationFields("custom-secret-abc123") preserving the secret until an operator adds a custom pattern—so any provider error that echoes a token, prompt fragment, or other custom secret will now leak to terminals/log shippers that previously never saw it.

Useful? React with 👍 / 👎.

@scoootscooob scoootscooob force-pushed the codex/clarify-embedded-transport-errors branch from 9521f48 to cea32a4 Compare March 21, 2026 04:47
@aisle-research-bot
Copy link
Copy Markdown

aisle-research-bot bot commented Mar 21, 2026

🔒 Aisle Security Analysis

We found 1 potential security issue(s) in this PR:

# Severity Title
1 🟡 Medium Sensitive data exposure via logging rawErrorPreview in consoleMessage

1. 🟡 Sensitive data exposure via logging rawErrorPreview in consoleMessage

Property Value
Severity Medium
CWE CWE-532
Location src/agents/pi-embedded-subscribe.handlers.lifecycle.ts:53-66

Description

handleAgentEnd appends observedError.rawErrorPreview into consoleMessage, which is the string emitted to console logs.

  • observedError.rawErrorPreview is derived from lastAssistant.errorMessage (provider/upstream error text).
  • Although buildApiErrorObservationFields() applies truncation and some redaction patterns, it is not a strict allowlist and may still include sensitive data (e.g., provider URLs, IPs, request metadata, user content echoed by upstream, or secrets not covered by patterns).
  • Embedding this preview into the unstructured console message can increase exposure:
    • it will be shown in pretty/compact console output (where structured meta fields are otherwise not displayed)
    • it can bypass downstream log-pipeline policies that redact/limit access to specific structured fields (because the content is now inside the message string)

Vulnerable code:

const safeRawErrorPreview = sanitizeForConsole(observedError.rawErrorPreview);
const rawErrorConsoleSuffix = safeRawErrorPreview ? ` rawError=${safeRawErrorPreview}` : "";
...
consoleMessage: `... error=${safeErrorText}${rawErrorConsoleSuffix}`,

Recommendation

Avoid placing raw/provider error previews into the unstructured consoleMessage. Keep potentially-sensitive diagnostics in structured fields (which can be access-controlled/redacted downstream), or log only a stable hash/fingerprint.

Safer options:

  1. Remove the raw error from consoleMessage (recommended):
ctx.log.warn("embedded run agent end", {
  ...observedError,
  consoleMessage: `embedded run agent end: runId=${safeRunId} isError=true model=${safeModel} provider=${safeProvider} error=${safeErrorText}`,
});
  1. If correlation is needed, log only a hash/fingerprint in consoleMessage:
const safeRawHash = sanitizeForConsole(observedError.rawErrorHash);
const rawSuffix = safeRawHash ? ` rawErrorHash=${safeRawHash}` : "";
consoleMessage: `... error=${safeErrorText}${rawSuffix}`,
  1. Gate raw previews behind an explicit debug/verbose flag, and consider additional allowlist-based redaction if you must emit it.

Analyzed PR: #51419 at commit cea32a4

Last updated on: 2026-03-21T05:00:29Z

@scoootscooob scoootscooob merged commit d78e13f into openclaw:main Mar 21, 2026
11 checks passed
@scoootscooob
Copy link
Copy Markdown
Contributor Author

Merged via squash.

Thanks @scoootscooob!

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: cea32a4bda

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines 682 to +688
if (prefixedCopy) {
return prefixedCopy;
}
const transportCopy = formatTransportErrorCopy(trimmed);
if (transportCopy) {
return transportCopy;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Rewrite bare transport errors outside the Error: prefix gate

formatTransportErrorCopy() only runs inside ERROR_PREFIX_RE here, but the main error-text readers (extractAssistantText() in src/agents/pi-embedded-utils.ts:236-251 and src/agents/tools/sessions-helpers.ts:177-199) pass bare assistant content through sanitizeUserFacingText(..., { errorContext: true }). That means an error message whose content is just socket hang up or connect ECONNREFUSED ... still reaches assistant streams/history replies unchanged, so this fix only works for providers that prepend Error:. If the goal is user-facing transport copy across embedded runs, the transport rewrite needs to happen before the prefix check.

Useful? React with 👍 / 👎.

Comment on lines +620 to +623
const transportCopy = formatTransportErrorCopy(raw);
if (transportCopy) {
return transportCopy;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve HTTP/Cloudflare formatting before DNS transport rewrites

Running formatTransportErrorCopy() before the HTTP/raw-payload formatter means any status text containing dns now short-circuits to DNS lookup for the provider endpoint failed. instead of the existing HTTP-aware message. In src/shared/assistant-error-format.ts:166-168, Cloudflare/HTML 530 responses are intentionally rendered as temporarily unavailable (HTTP 530); after this change, inputs like 530 Origin DNS error never reach that path and lose the upstream HTTP code, which misdiagnoses a provider/CDN outage as a local DNS failure. Restrict the DNS matcher to non-HTTP/raw-payload text or move it after the HTTP formatter.

Useful? React with 👍 / 👎.

JohnJAS pushed a commit to JohnJAS/openclaw that referenced this pull request Mar 22, 2026
Merged via squash.

Prepared head SHA: cea32a4
Co-authored-by: scoootscooob <[email protected]>
Co-authored-by: scoootscooob <[email protected]>
Reviewed-by: @scoootscooob
pholpaphankorn pushed a commit to pholpaphankorn/openclaw that referenced this pull request Mar 22, 2026
Merged via squash.

Prepared head SHA: cea32a4
Co-authored-by: scoootscooob <[email protected]>
Co-authored-by: scoootscooob <[email protected]>
Reviewed-by: @scoootscooob
frankekn pushed a commit to artwalker/openclaw that referenced this pull request Mar 23, 2026
Merged via squash.

Prepared head SHA: cea32a4
Co-authored-by: scoootscooob <[email protected]>
Co-authored-by: scoootscooob <[email protected]>
Reviewed-by: @scoootscooob
furaul pushed a commit to furaul/openclaw that referenced this pull request Mar 24, 2026
Merged via squash.

Prepared head SHA: cea32a4
Co-authored-by: scoootscooob <[email protected]>
Co-authored-by: scoootscooob <[email protected]>
Reviewed-by: @scoootscooob
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling maintainer Maintainer-authored PR size: S

Projects

None yet

Development

Successfully merging this pull request may close these issues.

LLM error messages are over-normalized: raw error details lost in logs

1 participant