fix(gateway): improve crash resilience for mDNS and network errors#4653
fix(gateway): improve crash resilience for mDNS and network errors#4653AyedAlmudarra wants to merge 3 commits intoopenclaw:mainfrom
Conversation
This commit addresses several gateway crash issues:
1. Expand Bonjour error handling (bonjour-ciao.ts):
- Add patterns for transient mDNS errors:
- 'REACHED ILLEGAL STATE' (IPv4 address changes)
- 'IPV4/IPV6 ADDRESS CHANGED'
- 'NETWORK INTERFACE' changes
- Handle AssertionError from MDNServer during network churn
- Fixes openclaw#3821 - mDNS crash loops on sleep/wake
2. Improve network error detection (unhandled-rejections.ts):
- Add ERR_ASSERTION to transient network codes
- Detect AssertionError + 'IPV4 ADDRESS CHANGED' pattern
- Better handling of @homebridge/ciao errors
- Fixes openclaw#3815 - unhandled fetch failures crashing gateway
3. Add --no-bonjour CLI flag (gateway-cli/run.ts):
- New option to disable mDNS/Bonjour advertising
- Sets OPENCLAW_DISABLE_BONJOUR=1 env var
- Provides workaround for openclaw#3821 while root cause is fixed
Closes openclaw#3821, openclaw#3815
Add tests for new mDNS/Bonjour error handling: - IPv4 address change errors (AssertionError) - MDNSServer illegal state errors Verifies fixes for openclaw#3821 - gateway no longer crashes on mDNS errors
- Add bonjour-errors.integration.test.ts with 8 tests - Update CHANGELOG.md with unreleased section documenting fixes - Fix MDNS → MDNSSERVER pattern in bonjour-ciao.ts Completes recommendations for openclaw#3821, openclaw#3815, openclaw#4501
| ## Unreleased | ||
|
|
||
| ### Fixes | ||
| - Gateway: fix crash resilience for mDNS/Bonjour errors during network interface changes (sleep/wake, WiFi reconnect). (#3821) |
There was a problem hiding this comment.
[P2] Changelog top section uses Unreleased, but repo guidelines say to keep the latest released version at the top and avoid an Unreleased section. This will likely get adjusted during release workflow, so it may be better to fold these entries into the next versioned section instead.
(Guide: CLAUDE.md "Changelog workflow: keep latest released version at top (no Unreleased)".)
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Prompt To Fix With AI
This is a comment left during a code review.
Path: CHANGELOG.md
Line: 5:8
Comment:
[P2] Changelog top section uses `Unreleased`, but repo guidelines say to keep the latest released version at the top and avoid an `Unreleased` section. This will likely get adjusted during release workflow, so it may be better to fold these entries into the next versioned section instead.
(Guide: `CLAUDE.md` "Changelog workflow: keep latest released version at top (no `Unreleased`)".)
<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>
How can I resolve this? If you propose a fix, please make it concise.| .option("--no-bonjour", "Disable mDNS/Bonjour service advertising", false) | ||
| .action(async (opts) => { | ||
| await runGatewayCommand(opts); | ||
| }); |
There was a problem hiding this comment.
[P1] --noonjour option is declared with a default false, which makes the negated flag a no-op in Commander (the default is already false). With this setup opts.bonjour will typically be false even when the user did not pass --no-bonjour, so runGatewayCommand will always set OPENCLAW_DISABLE_BONJOUR=1.
This should be defined without an explicit default (or with a true default via a positive --bonjour option) so that opts.bonjour === false only occurs when the user explicitly disables it.
Also appears in logic usage at src/cli/gateway-cli/run.ts:93-96.
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/cli/gateway-cli/run.ts
Line: 359:362
Comment:
[P1] `--noonjour` option is declared with a default `false`, which makes the negated flag a no-op in Commander (the default is already false). With this setup `opts.bonjour` will typically be `false` even when the user did not pass `--no-bonjour`, so `runGatewayCommand` will always set `OPENCLAW_DISABLE_BONJOUR=1`.
This should be defined without an explicit default (or with a true default via a positive `--bonjour` option) so that `opts.bonjour === false` only occurs when the user explicitly disables it.
Also appears in logic usage at `src/cli/gateway-cli/run.ts:93-96`.
How can I resolve this? If you propose a fix, please make it concise.| beforeAll(() => { | ||
| originalExit = process.exit.bind(process); | ||
| installUnhandledRejectionHandler(); | ||
| registerUnhandledRejectionHandler(ignoreCiaoCancellationRejection); | ||
| }); | ||
|
|
||
| afterAll(() => { | ||
| process.exit = originalExit; | ||
| }); |
There was a problem hiding this comment.
[P2] This test suite mutates process.exit in afterAll, but never stubs it during the tests. Since installUnhandledRejectionHandler calls process.exit(1) for fatal/config/unhandled errors, any unexpected rejection during this integration suite could terminate the test runner.
It’s safer to vi.spyOn(process, "exit") (like unhandled-rejections.fatal-detection.test.ts) and restore it in afterAll rather than only saving/restoring the real function.
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/infra/bonjour-errors.integration.test.ts
Line: 15:23
Comment:
[P2] This test suite mutates `process.exit` in `afterAll`, but never stubs it during the tests. Since `installUnhandledRejectionHandler` calls `process.exit(1)` for fatal/config/unhandled errors, any unexpected rejection during this integration suite could terminate the test runner.
It’s safer to `vi.spyOn(process, "exit")` (like `unhandled-rejections.fatal-detection.test.ts`) and restore it in `afterAll` rather than only saving/restoring the real function.
How can I resolve this? If you propose a fix, please make it concise.
Additional Comments (1)
If the intent is to suppress only @homebridge/ciao/MDNSServer assertion failures, consider narrowing the Prompt To Fix With AIThis is a comment left during a code review.
Path: src/infra/unhandled-rejections.ts
Line: 83:89
Comment:
[P2] In `isTransientNetworkError`, the early return for `code === "ERR_ASSERTION"` will mark *all* assertion errors as transient (due to `TRANSIENT_NETWORK_CODES` including `ERR_ASSERTION`), even when they’re unrelated to mDNS. If an actual bug surfaces as an assertion failure elsewhere, this will now be downgraded to a warning and the process will continue.
If the intent is to suppress only @homebridge/ciao/MDNSServer assertion failures, consider narrowing the `ERR_ASSERTION` allowance to message/name patterns (similar to the checks below) instead of treating all assertion errors as transient.
How can I resolve this? If you propose a fix, please make it concise. |
bfc1ccb to
f92900f
Compare
Summary
Fixes critical gateway stability issues caused by mDNS/Bonjour errors during network interface changes.
Issues Fixed
Changes
1. Error Handling (
src/infra/bonjour-ciao.ts)2. Network Error Detection (
src/infra/unhandled-rejections.ts)3. CLI Flag (
src/cli/gateway-cli/run.ts)--no-bonjourflag to disable mDNS advertising4. Tests & Docs
Checklist
Greptile Overview
Greptile Summary
This PR improves gateway stability around unhandled promise rejections by (1) broadening transient network error detection to include specific mDNS/Bonjour failures and (2) adding a new
--no-bonjourCLI flag to disable mDNS advertising. It also adds focused tests (unit + an integration-style suite) to cover the new mDNS error patterns and updates the changelog.The changes fit into the existing crash-resilience layer by extending
src/infra/unhandled-rejections.ts(which decides whether an unhandled rejection should exit the process) and by adding a dedicated bonjour/ciao rejection handler insrc/infra/bonjour-ciao.tsthat can be registered via the unhandled rejection handler registry.Confidence Score: 2/5
--no-bonjouroption appears to be declared with a default that makes it always evaluate to false, which would setOPENCLAW_DISABLE_BONJOUR=1even when the user did not pass the flag. Additionally, treating allERR_ASSERTIONerrors as transient may hide real assertion failures outside the mDNS context.(2/5) Greptile learns from your feedback when you react with thumbs up/down!