Skip to content

fix(telegram): use sticky IPv4 fallback on dual-stack failures#40435

Closed
sircrumpet wants to merge 4 commits intoopenclaw:mainfrom
sircrumpet:telegram/telegram-ipv4-sticky-fallback
Closed

fix(telegram): use sticky IPv4 fallback on dual-stack failures#40435
sircrumpet wants to merge 4 commits intoopenclaw:mainfrom
sircrumpet:telegram/telegram-ipv4-sticky-fallback

Conversation

@sircrumpet
Copy link
Copy Markdown
Contributor

@sircrumpet sircrumpet commented Mar 9, 2026

Summary

Describe the problem and fix in 2–5 bullets:

  • Problem: In affected dual-stack environments, Telegram could receive updates but fail to send replies (sendMessage / sendChatAction network failures), effectively breaking agent responses.
  • Why it matters: This is user-facing breakage: the bot appears online but cannot reliably respond.

A workaround exists (and has been suggested in related issues) where updating channels.telegram.network to:

network: {
	autoSelectFamily: false,
	dnsResultOrder: "ipv4first",
}

biases Telegram toward IPv4 and avoids Node’s auto-select behavior.

This PR handles the failure path automatically: when Telegram hits qualifying dual-stack connect errors, it falls back to IPv4 and stays there for that resolver lifecycle. So users who don’t know they need a network override still recover, and users with custom dispatchers/proxy routing are protected because fallback no longer overwrites caller-provided dispatchers.

  • What changed:
    • Made IPv4 fallback sticky per Telegram fetch resolver after first qualifying connect failure.
    • Switched fallback to a request-scoped IPv4 dispatcher (family: 4) instead of repeatedly mutating transport behavior.
    • Preserved caller-provided init.dispatcher during fallback retry.
    • Refined proxy/global-dispatcher handling so custom proxy dispatchers are preserved while EnvHttpProxyAgent can still be refreshed with Telegram connect options.
  • What did NOT change (scope boundary):
    • No new config keys/env vars.
    • No Discord/Slack transport changes.
    • No API/schema changes.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

User-visible / Behavior Changes

  • In affected environments, Telegram outbound delivery recovers by sticking to IPv4 after first qualifying dual-stack failure.
  • Repeated fallback warnings are significantly reduced because fallback is no longer retried in the same way on each request.
  • Retry path no longer overwrites explicit caller-provided dispatchers.

Security Impact (required)

  • New permissions/capabilities? (No)
  • Secrets/tokens handling changed? (No)
  • New/changed network calls? (No)
  • Command/tool execution surface changed? (No)
  • Data access scope changed? (No)
  • If any Yes, explain risk + mitigation:
    • N/A

Repro + Verification

Environment

  • OS: macOS
  • Runtime/container: Node v25.5.0
  • Model/provider: N/A (transport-layer issue) Tested with Claude Sonnet + GPT 5.4 via codex
  • Integration/channel (if any): Telegram
  • Relevant config (redacted): Telegram enabled across multiple accounts; no special network override required to trigger in affected environment, channels.telegram.network.autoSelectFamily not set.

Steps

  1. Start gateway in an environment with unstable IPv6/dual-stack reachability to Telegram.
  2. Send a Telegram message that requires bot reply.
  3. Observe outbound reply behavior and telegram/network logs.

Expected

  • Bot replies successfully to Telegram messages.
  • If fallback is needed, it activates once and remains stable for subsequent Telegram requests in that resolver lifecycle.

Actual

  • Before fix: repeated transport fallback churn and outbound failures (sendMessage / sendChatAction failed).
  • After fix: stable retry path with sticky IPv4 fallback and preserved per-request dispatchers.

Evidence

Attach at least one:

  • Failing test/log before + passing after
  • Trace/log snippets
  • Screenshot/recording
  • Perf numbers (if relevant)

Before (user logs, broken outbound replies):

07:44:18 warn telegram/network ... fetch fallback: forcing autoSelectFamily=false + dnsResultOrder=ipv4first
07:44:20 error gateway/channels/telegram ... telegram sendMessage failed: Network request for 'sendMessage' failed!
07:44:20 error gateway/channels/telegram ... telegram final reply failed: HttpError: Network request for 'sendMessage' failed!
08:01:27 warn telegram/network ... fetch fallback: retrying with IPv4-only dispatcher
08:01:27 warn telegram/network ... fetch fallback: retrying with IPv4-only dispatcher

After (tests/build on this branch):

✓ src/telegram/fetch.test.ts (18 tests) 16ms
Test Files  1 passed (1)
Tests       18 passed (18)
pnpm build
... completed successfully

Added/updated tests in src/telegram/fetch.test.ts cover:

  • sticky IPv4 fallback behavior,
  • EnvHttpProxyAgent refresh behavior,
  • preserving caller-provided dispatcher during fallback retry.

Human Verification (required)

What you personally verified (not just CI), and how:

  • Verified scenarios:
    • Targeted Telegram fetch tests pass with new sticky fallback + dispatcher preservation behavior.
    • Build passes.
    • Repro logs align with previously failing behavior and fallback churn addressed by this patch.
  • Edge cases checked:
    • Existing EnvHttpProxyAgent gets refreshed to apply connect options.
    • Existing custom proxy-like global dispatcher remains preservable.
    • Explicit request dispatcher is not overwritten on retry.
  • What you did not verify:
    • Full live matrix across all proxy topologies and Node minor versions.

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.

If a bot review conversation is addressed by this PR, resolve that conversation yourself. Do not leave bot review conversation cleanup for maintainers.

Compatibility / Migration

  • Backward compatible? (Yes)
  • Config/env changes? (No)
  • Migration needed? (No)
  • If yes, exact upgrade steps:
    • N/A

Failure Recovery (if this breaks)

  • How to disable/revert this change quickly:
    • Revert this PR commits.
  • Files/config to restore:
    • src/telegram/fetch.ts
    • src/telegram/fetch.test.ts
  • Known bad symptoms reviewers should watch for:
    • Telegram outbound replies failing again under dual-stack network errors.
    • Reappearance of repeated fallback warning spam.
    • Retry path overriding caller-supplied dispatcher.

Risks and Mitigations

  • Risk:

    • Sticky fallback may keep IPv4 path longer than strictly necessary after transient network recovery.
    • Mitigation:
    • Scope is resolver-local and only activates after qualifying network errors.
  • Risk:

    • Proxy routing regression if fallback overrides custom dispatcher.
    • Mitigation:
    • Fallback now explicitly preserves caller-provided init.dispatcher.

AI-assisted: Codex for Investigation, review of git history / related items, root cause analysis
Testing level: Targeted unit test + build, tested resolution of fix before + after applying branch.

@openclaw-barnacle openclaw-barnacle bot added channel: telegram Channel integration: telegram size: S labels Mar 9, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 9, 2026

Greptile Summary

This PR fixes a user-facing Telegram bot breakage in dual-stack environments where outbound sendMessage/sendChatAction calls failed with IPv6 network errors. The fix replaces the old approach of mutating global transport state on each fallback with a cleaner per-request IPv4-only dispatcher that is lazily created once and cached at module scope.

Key changes:

  • resolveGlobalDispatcherKind replaces isProxyLikeDispatcher, enabling EnvHttpProxyAgent instances to be refreshed while custom ProxyAgent dispatchers are preserved.
  • ipv4FallbackDispatcher is a module-level cached EnvHttpProxyAgent({ connect: { family: 4 } }) injected on fallback via buildIpv4FallbackInit.
  • stickyIpv4FallbackEnabled (closure variable scoped per resolver) ensures that once a qualifying dual-stack failure is observed, all subsequent requests through that resolver use the IPv4 dispatcher proactively — eliminating repeated failure/retry churn.
  • Caller-supplied init.dispatcher values are preserved and never overridden by the fallback path.
  • Transport layer only; no API or schema impact. Well-targeted test coverage includes sticky fallback behavior, dispatcher preservation, and EnvHttpProxyAgent refresh scenarios.

Note on design: stickyIpv4FallbackEnabled is scoped per resolver instance, while ipv4FallbackDispatcher is shared at module scope. A new resolver created after a failure will reset sticky to false, allowing one additional wasted attempt. This represents a design trade-off—per-resolver scoping allows recovery if configuration changes, but means the "sticky" guarantee applies per-resolver, not globally.

Confidence Score: 4/5

  • Safe to merge; transport-layer fix with no API or schema impact and well-targeted test coverage.
  • The implementation correctly addresses the root cause (per-request IPv4-only dispatcher replacing global state mutation) and preserves caller-provided dispatchers. Two minor issues identified: (1) misleading inline comment about dispatcher reuse, and (2) design asymmetry where sticky flag resets per resolver but dispatcher is module-level. Neither is a correctness bug—the per-resolver scoping is an intentional design trade-off mentioned in the PR description.
  • src/telegram/fetch.ts — specifically the misleading comment on line 201 and the per-resolver vs per-module scoping of the sticky flag and dispatcher.

Comments Outside Diff (1)

  1. src/telegram/fetch.ts, line 234-251 (link)

    stickyIpv4FallbackEnabled is scoped to the closure returned by resolveTelegramFetch(). If a new resolver instance is created after this flag has been triggered (e.g., on config reload), the new instance starts with stickyIpv4FallbackEnabled = false, allowing the first request through the new resolver to attempt the non-IPv4-only path and potentially fail again.

    While ipv4FallbackDispatcher is correctly cached at module scope and reused across resolvers, the per-closure flag means the sticky activation logic resets on each new resolver instance. This slightly weakens the "sticky" guarantee — one additional wasted attempt is possible after resolver recreation.

    Consider promoting stickyIpv4FallbackEnabled to module scope alongside ipv4FallbackDispatcher to achieve true sticky behavior across the process lifetime, or add an explicit code comment documenting that callers must retain the same resolver instance for stickiness to remain fully effective.

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: src/telegram/fetch.ts
    Line: 234-251
    
    Comment:
    `stickyIpv4FallbackEnabled` is scoped to the closure returned by `resolveTelegramFetch()`. If a new resolver instance is created after this flag has been triggered (e.g., on config reload), the new instance starts with `stickyIpv4FallbackEnabled = false`, allowing the first request through the new resolver to attempt the non-IPv4-only path and potentially fail again.
    
    While `ipv4FallbackDispatcher` is correctly cached at module scope and reused across resolvers, the per-closure flag means the sticky activation logic resets on each new resolver instance. This slightly weakens the "sticky" guarantee — one additional wasted attempt is possible after resolver recreation.
    
    Consider promoting `stickyIpv4FallbackEnabled` to module scope alongside `ipv4FallbackDispatcher` to achieve true sticky behavior across the process lifetime, or add an explicit code comment documenting that callers must retain the same resolver instance for stickiness to remain fully effective.
    
    How can I resolve this? If you propose a fix, please make it concise.

Last reviewed commit: 10d239a

@sircrumpet
Copy link
Copy Markdown
Contributor Author

@greptile-apps Updated code comment for clarity.

Thanks, regarding the sticky flag, this was considered and the current behavior is intentional.
stickyIpv4FallbackEnabled is scoped to the resolver closure so stickiness follows that resolver’s lifecycle, while the IPv4 dispatcher instance is shared for reuse efficiency.
We avoided process-global sticky state because it can over-pin IPv4 for the full process lifetime after a transient dual-stack issue. With per-resolver stickiness, a recreated resolver gets one fresh probe of the default path, which is a deliberate tradeoff to allow recovery when network conditions/config change. The potential extra fallback after resolver recreation is acceptable versus forcing permanent IPv4 bias process-wide.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 9, 2026

Got it, that makes sense. Per-resolver stickiness gives you the recovery opportunity on recreate while avoiding permanent IPv4 lock-in process-wide. The tradeoff is reasonable - better to probe once per resolver lifecycle than stay stuck on IPv4 after a transient blip.

Thanks for clarifying the comment.

@sircrumpet
Copy link
Copy Markdown
Contributor Author

Closing this as superseded by #40740.

This PR improved fallback behavior (sticky IPv4 + preserving caller dispatcher), but it still kept Telegram’s global transport mutation path in src/telegram/fetch.ts (process-level net/dns/global dispatcher interaction).

The replacement PR keeps the same user-facing goal (recover from dual-stack failures) while moving to resolver-scoped transport policy:

  • no Telegram process-global network toggling,
  • per-resolver dispatcher policy (normal + sticky IPv4 fallback),
  • preserved caller-provided dispatcher on retry,
  • tests for multi-resolver isolation and sticky fallback behavior.

@sircrumpet sircrumpet closed this Mar 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

channel: telegram Channel integration: telegram size: S

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Telegram inbound messages not reaching agent after v2026.2.19-2 update

1 participant