fix(agent): clarify embedded transport errors#51419
fix(agent): clarify embedded transport errors#51419scoootscooob merged 2 commits intoopenclaw:mainfrom
Conversation
Greptile SummaryThis PR improves embedded agent error observability by distinguishing common transport/network failures (ECONNREFUSED, ENOTFOUND, ECONNRESET, socket hang-up, etc.) from the generic Key changes:
Suggestions:
Confidence Score: 4/5
Prompt To Fix All With AIThis is a comment left during a code review.
Path: src/agents/pi-embedded-helpers/errors.ts
Line: 83
Comment:
**`dns` substring match is too broad**
`lower.includes("dns")` is a 3-character bare-substring check with no word boundary. Any error message that contains "dns" as part of a hostname, a JSON key, or an infrastructure path would be misclassified as a DNS resolution failure. For example, an SSL handshake failure from a host like `dns-cache-01.provider.com` would be reported as "DNS lookup for the provider endpoint failed." even though the resolver succeeded.
The other matchers in this block are far more specific: `\benotfound\b`, `\beai_again\b`, `getaddrinfo`, and `no such host` already cover every standard DNS-failure signal. The bare `"dns"` fallback adds false-positive risk without closing a meaningful coverage gap.
Consider removing the `lower.includes("dns")` check, or tightening it to a word-boundary pattern:
```suggestion
/\benotfound\b|\beai_again\b/i.test(raw) ||
lower.includes("getaddrinfo") ||
lower.includes("no such host") ||
/\bdns\b/i.test(raw)
```
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: src/agents/pi-embedded-helpers/errors.ts
Line: 56-105
Comment:
**Transport errors not classified as failover reasons**
`formatTransportErrorCopy` returns friendly user-facing copy for transport errors (ECONNREFUSED, ENOTFOUND, socket hang up, etc.), but `classifyFailoverReason` is never updated to handle these cases. As a result, these errors will return `null` from `classifyFailoverReason`, meaning no failover attempt is made when a transport failure occurs, even though a retry might succeed (e.g., a transient connection reset or DNS blip).
This is noted as an explicit scope boundary in the PR description ("does not change request timeout configuration"), so this may be intentional. If so, a short comment documenting why transport errors intentionally bypass failover would help future maintainers avoid confusion.
How can I resolve this? If you propose a fix, please make it concise.Last reviewed commit: "Agent: clarify embed..." |
| /\benotfound\b|\beai_again\b/i.test(raw) || | ||
| lower.includes("getaddrinfo") || | ||
| lower.includes("no such host") || | ||
| lower.includes("dns") |
There was a problem hiding this comment.
dns substring match is too broad
lower.includes("dns") is a 3-character bare-substring check with no word boundary. Any error message that contains "dns" as part of a hostname, a JSON key, or an infrastructure path would be misclassified as a DNS resolution failure. For example, an SSL handshake failure from a host like dns-cache-01.provider.com would be reported as "DNS lookup for the provider endpoint failed." even though the resolver succeeded.
The other matchers in this block are far more specific: \benotfound\b, \beai_again\b, getaddrinfo, and no such host already cover every standard DNS-failure signal. The bare "dns" fallback adds false-positive risk without closing a meaningful coverage gap.
Consider removing the lower.includes("dns") check, or tightening it to a word-boundary pattern:
| lower.includes("dns") | |
| /\benotfound\b|\beai_again\b/i.test(raw) || | |
| lower.includes("getaddrinfo") || | |
| lower.includes("no such host") || | |
| /\bdns\b/i.test(raw) |
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/agents/pi-embedded-helpers/errors.ts
Line: 83
Comment:
**`dns` substring match is too broad**
`lower.includes("dns")` is a 3-character bare-substring check with no word boundary. Any error message that contains "dns" as part of a hostname, a JSON key, or an infrastructure path would be misclassified as a DNS resolution failure. For example, an SSL handshake failure from a host like `dns-cache-01.provider.com` would be reported as "DNS lookup for the provider endpoint failed." even though the resolver succeeded.
The other matchers in this block are far more specific: `\benotfound\b`, `\beai_again\b`, `getaddrinfo`, and `no such host` already cover every standard DNS-failure signal. The bare `"dns"` fallback adds false-positive risk without closing a meaningful coverage gap.
Consider removing the `lower.includes("dns")` check, or tightening it to a word-boundary pattern:
```suggestion
/\benotfound\b|\beai_again\b/i.test(raw) ||
lower.includes("getaddrinfo") ||
lower.includes("no such host") ||
/\bdns\b/i.test(raw)
```
How can I resolve this? If you propose a fix, please make it concise.| function formatTransportErrorCopy(raw: string): string | undefined { | ||
| if (!raw) { | ||
| return undefined; | ||
| } | ||
| const lower = raw.toLowerCase(); | ||
|
|
||
| if ( | ||
| /\beconnrefused\b/i.test(raw) || | ||
| lower.includes("connection refused") || | ||
| lower.includes("actively refused") | ||
| ) { | ||
| return "LLM request failed: connection refused by the provider endpoint."; | ||
| } | ||
|
|
||
| if ( | ||
| /\beconnreset\b|\beconnaborted\b|\benetreset\b|\bepipe\b/i.test(raw) || | ||
| lower.includes("socket hang up") || | ||
| lower.includes("connection reset") || | ||
| lower.includes("connection aborted") | ||
| ) { | ||
| return "LLM request failed: network connection was interrupted."; | ||
| } | ||
|
|
||
| if ( | ||
| /\benotfound\b|\beai_again\b/i.test(raw) || | ||
| lower.includes("getaddrinfo") || | ||
| lower.includes("no such host") || | ||
| lower.includes("dns") | ||
| ) { | ||
| return "LLM request failed: DNS lookup for the provider endpoint failed."; | ||
| } | ||
|
|
||
| if ( | ||
| /\benetunreach\b|\behostunreach\b|\behostdown\b/i.test(raw) || | ||
| lower.includes("network is unreachable") || | ||
| lower.includes("host is unreachable") | ||
| ) { | ||
| return "LLM request failed: the provider endpoint is unreachable from this host."; | ||
| } | ||
|
|
||
| if ( | ||
| lower.includes("fetch failed") || | ||
| lower.includes("connection error") || | ||
| lower.includes("network request failed") | ||
| ) { | ||
| return "LLM request failed: network connection error."; | ||
| } | ||
|
|
||
| return undefined; | ||
| } |
There was a problem hiding this comment.
Transport errors not classified as failover reasons
formatTransportErrorCopy returns friendly user-facing copy for transport errors (ECONNREFUSED, ENOTFOUND, socket hang up, etc.), but classifyFailoverReason is never updated to handle these cases. As a result, these errors will return null from classifyFailoverReason, meaning no failover attempt is made when a transport failure occurs, even though a retry might succeed (e.g., a transient connection reset or DNS blip).
This is noted as an explicit scope boundary in the PR description ("does not change request timeout configuration"), so this may be intentional. If so, a short comment documenting why transport errors intentionally bypass failover would help future maintainers avoid confusion.
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/agents/pi-embedded-helpers/errors.ts
Line: 56-105
Comment:
**Transport errors not classified as failover reasons**
`formatTransportErrorCopy` returns friendly user-facing copy for transport errors (ECONNREFUSED, ENOTFOUND, socket hang up, etc.), but `classifyFailoverReason` is never updated to handle these cases. As a result, these errors will return `null` from `classifyFailoverReason`, meaning no failover attempt is made when a transport failure occurs, even though a retry might succeed (e.g., a transient connection reset or DNS blip).
This is noted as an explicit scope boundary in the PR description ("does not change request timeout configuration"), so this may be intentional. If so, a short comment documenting why transport errors intentionally bypass failover would help future maintainers avoid confusion.
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9521f48b85
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| model: lastAssistant.model, | ||
| provider: lastAssistant.provider, | ||
| ...observedError, | ||
| consoleMessage: `embedded run agent end: runId=${safeRunId} isError=true model=${safeModel} provider=${safeProvider} error=${safeErrorText}`, | ||
| consoleMessage: `embedded run agent end: runId=${safeRunId} isError=true model=${safeModel} provider=${safeProvider} error=${safeErrorText}${rawErrorConsoleSuffix}`, |
There was a problem hiding this comment.
Stop appending raw error previews to consoleMessage
In pretty/compact console mode, consoleMessage is the only text that gets printed (src/logging/subsystem.ts:193-235), so adding rawErrorPreview here widens the console log surface from the previous friendly-only line to raw provider payload text. That preview is not fully redacted by default—src/agents/pi-embedded-error-observation.test.ts:133-145 shows buildApiErrorObservationFields("custom-secret-abc123") preserving the secret until an operator adds a custom pattern—so any provider error that echoes a token, prompt fragment, or other custom secret will now leak to terminals/log shippers that previously never saw it.
Useful? React with 👍 / 👎.
9521f48 to
cea32a4
Compare
🔒 Aisle Security AnalysisWe found 1 potential security issue(s) in this PR:
1. 🟡 Sensitive data exposure via logging rawErrorPreview in consoleMessage
Description
Vulnerable code: const safeRawErrorPreview = sanitizeForConsole(observedError.rawErrorPreview);
const rawErrorConsoleSuffix = safeRawErrorPreview ? ` rawError=${safeRawErrorPreview}` : "";
...
consoleMessage: `... error=${safeErrorText}${rawErrorConsoleSuffix}`,RecommendationAvoid placing raw/provider error previews into the unstructured Safer options:
ctx.log.warn("embedded run agent end", {
...observedError,
consoleMessage: `embedded run agent end: runId=${safeRunId} isError=true model=${safeModel} provider=${safeProvider} error=${safeErrorText}`,
});
const safeRawHash = sanitizeForConsole(observedError.rawErrorHash);
const rawSuffix = safeRawHash ? ` rawErrorHash=${safeRawHash}` : "";
consoleMessage: `... error=${safeErrorText}${rawSuffix}`,
Analyzed PR: #51419 at commit Last updated on: 2026-03-21T05:00:29Z |
|
Merged via squash.
Thanks @scoootscooob! |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: cea32a4bda
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if (prefixedCopy) { | ||
| return prefixedCopy; | ||
| } | ||
| const transportCopy = formatTransportErrorCopy(trimmed); | ||
| if (transportCopy) { | ||
| return transportCopy; | ||
| } |
There was a problem hiding this comment.
Rewrite bare transport errors outside the
Error: prefix gate
formatTransportErrorCopy() only runs inside ERROR_PREFIX_RE here, but the main error-text readers (extractAssistantText() in src/agents/pi-embedded-utils.ts:236-251 and src/agents/tools/sessions-helpers.ts:177-199) pass bare assistant content through sanitizeUserFacingText(..., { errorContext: true }). That means an error message whose content is just socket hang up or connect ECONNREFUSED ... still reaches assistant streams/history replies unchanged, so this fix only works for providers that prepend Error:. If the goal is user-facing transport copy across embedded runs, the transport rewrite needs to happen before the prefix check.
Useful? React with 👍 / 👎.
| const transportCopy = formatTransportErrorCopy(raw); | ||
| if (transportCopy) { | ||
| return transportCopy; | ||
| } |
There was a problem hiding this comment.
Preserve HTTP/Cloudflare formatting before DNS transport rewrites
Running formatTransportErrorCopy() before the HTTP/raw-payload formatter means any status text containing dns now short-circuits to DNS lookup for the provider endpoint failed. instead of the existing HTTP-aware message. In src/shared/assistant-error-format.ts:166-168, Cloudflare/HTML 530 responses are intentionally rendered as temporarily unavailable (HTTP 530); after this change, inputs like 530 Origin DNS error never reach that path and lose the upstream HTTP code, which misdiagnoses a provider/CDN outage as a local DNS failure. Restrict the DNS matcher to non-HTTP/raw-payload text or move it after the HTTP formatter.
Useful? React with 👍 / 👎.
Merged via squash. Prepared head SHA: cea32a4 Co-authored-by: scoootscooob <[email protected]> Co-authored-by: scoootscooob <[email protected]> Reviewed-by: @scoootscooob
Merged via squash. Prepared head SHA: cea32a4 Co-authored-by: scoootscooob <[email protected]> Co-authored-by: scoootscooob <[email protected]> Reviewed-by: @scoootscooob
Merged via squash. Prepared head SHA: cea32a4 Co-authored-by: scoootscooob <[email protected]> Co-authored-by: scoootscooob <[email protected]> Reviewed-by: @scoootscooob
Merged via squash. Prepared head SHA: cea32a4 Co-authored-by: scoootscooob <[email protected]> Co-authored-by: scoootscooob <[email protected]> Reviewed-by: @scoootscooob
Summary
Describe the problem and fix in 2-5 bullets:
LLM request timed out.message.Change Type (select all)
Scope (select all touched areas)
Linked Issue/PR
User-visible / Behavior Changes
LLM request timed out.rawErrorPreviewfor operator diagnosis.Security Impact (required)
No)No)No)No)No)Yes, explain risk + mitigation:Repro + Verification
Environment
Steps
formatAssistantErrorText()with transport-style failures such asECONNREFUSED,ENOTFOUND, andsocket hang up.Expected
Actual
LLM request timed out.and the console line omitted the redacted raw preview.Evidence
Attach at least one:
Human Verification (required)
What you personally verified (not just CI), and how:
pnpm test -- src/agents/pi-embedded-helpers.formatassistanterrortext.test.tspnpm test -- src/agents/pi-embedded-helpers.sanitizeuserfacingtext.test.tspnpm test -- src/agents/pi-embedded-subscribe.handlers.lifecycle.test.tspnpm test -- extensions/discord/src/monitor/provider.lifecycle.test.tsReview Conversations
If a bot review conversation is addressed by this PR, resolve that conversation yourself. Do not leave bot review conversation cleanup for maintainers.
Compatibility / Migration
Yes)No)No)Failure Recovery (if this breaks)
src/agents/pi-embedded-helpers/errors.tssrc/agents/pi-embedded-subscribe.handlers.lifecycle.tsRisks and Mitigations
List only real risks for this PR. Add/remove entries as needed. If none, write
None.