Skip to content

Comments

fix(gateway): improve crash resilience for mDNS and network errors#4653

Open
AyedAlmudarra wants to merge 3 commits intoopenclaw:mainfrom
AyedAlmudarra:fix/gateway-crash-resilience
Open

fix(gateway): improve crash resilience for mDNS and network errors#4653
AyedAlmudarra wants to merge 3 commits intoopenclaw:mainfrom
AyedAlmudarra:fix/gateway-crash-resilience

Conversation

@AyedAlmudarra
Copy link

@AyedAlmudarra AyedAlmudarra commented Jan 30, 2026

Summary

Fixes critical gateway stability issues caused by mDNS/Bonjour errors during network interface changes.

Issues Fixed

Changes

1. Error Handling (src/infra/bonjour-ciao.ts)

  • Added error patterns for transient mDNS issues
  • Catches AssertionError from MDNServer

2. Network Error Detection (src/infra/unhandled-rejections.ts)

  • Added ERR_ASSERTION to transient network codes
  • Better detection of mDNS-related assertion errors

3. CLI Flag (src/cli/gateway-cli/run.ts)

  • Added --no-bonjour flag to disable mDNS advertising

4. Tests & Docs

  • Added unit and integration tests (42 total, all passing)
  • Updated CHANGELOG.md

Checklist

  • Code follows project style
  • Tests added and passing
  • CHANGELOG.md updated
  • No breaking changes

Greptile Overview

Greptile Summary

This PR improves gateway stability around unhandled promise rejections by (1) broadening transient network error detection to include specific mDNS/Bonjour failures and (2) adding a new --no-bonjour CLI flag to disable mDNS advertising. It also adds focused tests (unit + an integration-style suite) to cover the new mDNS error patterns and updates the changelog.

The changes fit into the existing crash-resilience layer by extending src/infra/unhandled-rejections.ts (which decides whether an unhandled rejection should exit the process) and by adding a dedicated bonjour/ciao rejection handler in src/infra/bonjour-ciao.ts that can be registered via the unhandled rejection handler registry.

Confidence Score: 2/5

  • This PR is not yet safe to merge due to a likely CLI flag default bug that can disable Bonjour unintentionally.
  • Most changes are localized and well-tested, but the Commander --no-bonjour option appears to be declared with a default that makes it always evaluate to false, which would set OPENCLAW_DISABLE_BONJOUR=1 even when the user did not pass the flag. Additionally, treating all ERR_ASSERTION errors as transient may hide real assertion failures outside the mDNS context.
  • src/cli/gateway-cli/run.ts, src/infra/unhandled-rejections.ts

(2/5) Greptile learns from your feedback when you react with thumbs up/down!

This commit addresses several gateway crash issues:

1. Expand Bonjour error handling (bonjour-ciao.ts):
   - Add patterns for transient mDNS errors:
     - 'REACHED ILLEGAL STATE' (IPv4 address changes)
     - 'IPV4/IPV6 ADDRESS CHANGED'
     - 'NETWORK INTERFACE' changes
   - Handle AssertionError from MDNServer during network churn
   - Fixes openclaw#3821 - mDNS crash loops on sleep/wake

2. Improve network error detection (unhandled-rejections.ts):
   - Add ERR_ASSERTION to transient network codes
   - Detect AssertionError + 'IPV4 ADDRESS CHANGED' pattern
   - Better handling of @homebridge/ciao errors
   - Fixes openclaw#3815 - unhandled fetch failures crashing gateway

3. Add --no-bonjour CLI flag (gateway-cli/run.ts):
   - New option to disable mDNS/Bonjour advertising
   - Sets OPENCLAW_DISABLE_BONJOUR=1 env var
   - Provides workaround for openclaw#3821 while root cause is fixed

Closes openclaw#3821, openclaw#3815
Add tests for new mDNS/Bonjour error handling:
- IPv4 address change errors (AssertionError)
- MDNSServer illegal state errors

Verifies fixes for openclaw#3821 - gateway no longer crashes on mDNS errors
- Add bonjour-errors.integration.test.ts with 8 tests
- Update CHANGELOG.md with unreleased section documenting fixes
- Fix MDNS → MDNSSERVER pattern in bonjour-ciao.ts

Completes recommendations for openclaw#3821, openclaw#3815, openclaw#4501
@openclaw-barnacle openclaw-barnacle bot added the cli CLI command changes label Jan 30, 2026
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 files reviewed, 4 comments

Edit Code Review Agent Settings | Greptile

Comment on lines +5 to +8
## Unreleased

### Fixes
- Gateway: fix crash resilience for mDNS/Bonjour errors during network interface changes (sleep/wake, WiFi reconnect). (#3821)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P2] Changelog top section uses Unreleased, but repo guidelines say to keep the latest released version at the top and avoid an Unreleased section. This will likely get adjusted during release workflow, so it may be better to fold these entries into the next versioned section instead.

(Guide: CLAUDE.md "Changelog workflow: keep latest released version at top (no Unreleased)".)

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: CHANGELOG.md
Line: 5:8

Comment:
[P2] Changelog top section uses `Unreleased`, but repo guidelines say to keep the latest released version at the top and avoid an `Unreleased` section. This will likely get adjusted during release workflow, so it may be better to fold these entries into the next versioned section instead.

(Guide: `CLAUDE.md` "Changelog workflow: keep latest released version at top (no `Unreleased`)".)

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +359 to 362
.option("--no-bonjour", "Disable mDNS/Bonjour service advertising", false)
.action(async (opts) => {
await runGatewayCommand(opts);
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P1] --noonjour option is declared with a default false, which makes the negated flag a no-op in Commander (the default is already false). With this setup opts.bonjour will typically be false even when the user did not pass --no-bonjour, so runGatewayCommand will always set OPENCLAW_DISABLE_BONJOUR=1.

This should be defined without an explicit default (or with a true default via a positive --bonjour option) so that opts.bonjour === false only occurs when the user explicitly disables it.

Also appears in logic usage at src/cli/gateway-cli/run.ts:93-96.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/cli/gateway-cli/run.ts
Line: 359:362

Comment:
[P1] `--noonjour` option is declared with a default `false`, which makes the negated flag a no-op in Commander (the default is already false). With this setup `opts.bonjour` will typically be `false` even when the user did not pass `--no-bonjour`, so `runGatewayCommand` will always set `OPENCLAW_DISABLE_BONJOUR=1`.

This should be defined without an explicit default (or with a true default via a positive `--bonjour` option) so that `opts.bonjour === false` only occurs when the user explicitly disables it.

Also appears in logic usage at `src/cli/gateway-cli/run.ts:93-96`.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +15 to +23
beforeAll(() => {
originalExit = process.exit.bind(process);
installUnhandledRejectionHandler();
registerUnhandledRejectionHandler(ignoreCiaoCancellationRejection);
});

afterAll(() => {
process.exit = originalExit;
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P2] This test suite mutates process.exit in afterAll, but never stubs it during the tests. Since installUnhandledRejectionHandler calls process.exit(1) for fatal/config/unhandled errors, any unexpected rejection during this integration suite could terminate the test runner.

It’s safer to vi.spyOn(process, "exit") (like unhandled-rejections.fatal-detection.test.ts) and restore it in afterAll rather than only saving/restoring the real function.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/infra/bonjour-errors.integration.test.ts
Line: 15:23

Comment:
[P2] This test suite mutates `process.exit` in `afterAll`, but never stubs it during the tests. Since `installUnhandledRejectionHandler` calls `process.exit(1)` for fatal/config/unhandled errors, any unexpected rejection during this integration suite could terminate the test runner.

It’s safer to `vi.spyOn(process, "exit")` (like `unhandled-rejections.fatal-detection.test.ts`) and restore it in `afterAll` rather than only saving/restoring the real function.

How can I resolve this? If you propose a fix, please make it concise.

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 3, 2026

Additional Comments (1)

src/infra/unhandled-rejections.ts
[P2] In isTransientNetworkError, the early return for code === "ERR_ASSERTION" will mark all assertion errors as transient (due to TRANSIENT_NETWORK_CODES including ERR_ASSERTION), even when they’re unrelated to mDNS. If an actual bug surfaces as an assertion failure elsewhere, this will now be downgraded to a warning and the process will continue.

If the intent is to suppress only @homebridge/ciao/MDNSServer assertion failures, consider narrowing the ERR_ASSERTION allowance to message/name patterns (similar to the checks below) instead of treating all assertion errors as transient.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/infra/unhandled-rejections.ts
Line: 83:89

Comment:
[P2] In `isTransientNetworkError`, the early return for `code === "ERR_ASSERTION"` will mark *all* assertion errors as transient (due to `TRANSIENT_NETWORK_CODES` including `ERR_ASSERTION`), even when they’re unrelated to mDNS. If an actual bug surfaces as an assertion failure elsewhere, this will now be downgraded to a warning and the process will continue.

If the intent is to suppress only @homebridge/ciao/MDNSServer assertion failures, consider narrowing the `ERR_ASSERTION` allowance to message/name patterns (similar to the checks below) instead of treating all assertion errors as transient.

How can I resolve this? If you propose a fix, please make it concise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cli CLI command changes

Projects

None yet

1 participant