Skip to content

fix(config): log observe recovery write failures#75441

Merged
sallyom merged 1 commit intoopenclaw:mainfrom
sallyom:fix/config-recovery-observe-errors
May 1, 2026
Merged

fix(config): log observe recovery write failures#75441
sallyom merged 1 commit intoopenclaw:mainfrom
sallyom:fix/config-recovery-observe-errors

Conversation

@sallyom
Copy link
Copy Markdown
Contributor

@sallyom sallyom commented May 1, 2026

Summary

Fixes #63423

Test plan

  • pnpm test src/config/io.observe-recovery.test.ts src/config/io.audit.test.ts
  • pnpm lint:core
  • pnpm tsgo:core
  • pnpm tsgo:core:test
  • pnpm exec oxfmt --check --threads=1 CHANGELOG.md src/config/io.audit.ts src/config/io.ts src/config/io.observe-recovery.ts src/config/io.observe-recovery.test.ts
  • git diff --check

@openclaw-barnacle openclaw-barnacle Bot added size: M maintainer Maintainer-authored PR labels May 1, 2026
@clawsweeper
Copy link
Copy Markdown
Contributor

clawsweeper Bot commented May 1, 2026

Codex review: needs maintainer review before merge.

What this changes:

The PR logs config health-state write failures in config I/O and observe-recovery helpers, adds async and sync regression tests for those failure paths, and adds a changelog entry.

Maintainer follow-up before merge:

This is an open maintainer-labeled implementation PR with a useful narrow code diff, but the remaining work is maintainer/author handling of PR metadata and exact-head CI rather than a safe autonomous repair lane.

Security review:

Security review cleared: The diff is limited to local config recovery logging, focused tests, and changelog text, with no dependency, workflow, permission, secret-handling, network, or code-execution changes.

Review details

Best possible solution:

Land the narrow config health-state logging and regression coverage after removing the auto-close wording for #63423 and getting exact-head CI green. Keep #63423 open, or split it into narrower follow-ups, for the remaining pairing and infra empty-catch cleanup.

Do we have a high-confidence way to reproduce the issue?

Yes. A high-confidence reproduction path is to inject writeFile or writeFileSync failures for .openclaw/logs/config-health.json; current main catches those failures silently in both config I/O and observe-recovery health-state helpers, while the PR adds tests for those injected failures.

Is this the best way to solve the issue?

Yes for the source patch direction: logging at the health-state helper boundary with async and sync tests is a narrow maintainable fix. The PR metadata should be changed from closing #63423 to relating to it, and the current CI failure must be cleared before merge.

What I checked:

  • Current main still swallows observe-recovery health-state writes: writeConfigHealthState and writeConfigHealthStateSync in the observe-recovery module still end with empty catch {} blocks on current main, so the central behavior requested by this PR is not implemented there. (src/config/io.observe-recovery.ts:339, 7ac23eeeb581)
  • Current main still treats config I/O health-state writes as silent best-effort: The config I/O helper catches health-state write failures and only leaves a best-effort comment, which means ordinary config reads/writes do not surface those failures today. (src/config/io.ts:440, 7ac23eeeb581)
  • Current PR diff is scoped to health-state write logging: The latest PR head adds warning logs around health-state writes in src/config/io.observe-recovery.ts and src/config/io.ts, adds focused tests in src/config/io.observe-recovery.test.ts and src/config/io.write-config.test.ts, and adds one changelog entry. (src/config/io.observe-recovery.ts:344, c49ed32f4567)
  • The related umbrella issue is broader than this PR: The PR body still says Fixes #63423, while the provided related issue body lists remaining empty catches in src/pairing/ and src/infra/; this config-only diff should not auto-close that broader report. (c49ed32f4567)
  • Protected open PR state and exact-head CI: The GitHub API reports the PR is open, not draft, labeled maintainer, mergeable with mergeable_state: unstable, and exact head c49ed32f4567fe004a9c04d95348ebf8d46da557 has a completed checks-fast-protocol failure. (c49ed32f4567)
  • Feature history points to config recovery maintainers: GitHub commit history for src/config/io.observe-recovery.ts includes the recent fix(config): surface backup restore copy failures in audit and logs #70515 restore-copy audit fix, earlier config clobber/invalid-config recovery work, and adjacent validation-detail recovery work in the same module and tests. (src/config/io.observe-recovery.ts)

Likely related people:

  • steipete: File history shows Peter Steinberger authored the invalid-config recovery and critical config-clobber recovery commits that established the observe-recovery module and current recovery behavior. (role: introduced behavior and config recovery maintainer; confidence: high; commits: ffb1628727fd, 7b2c9a6fa3d3, 83801c49f7a3; files: src/config/io.observe-recovery.ts, src/config/io.observe-recovery.test.ts, src/config/io.ts)
  • davidangularme: Commit f8ffc3e merged fix(config): surface backup restore copy failures in audit and logs #70515 and updated the same observe-recovery/audit/test files for backup restore write failure logging and audit behavior that overlap the discussion around this PR. (role: recent merged fix author; confidence: high; commits: f8ffc3ec4f97; files: src/config/io.observe-recovery.ts, src/config/io.observe-recovery.test.ts, src/config/io.audit.ts)
  • amknight: Commit aa9db99 added config recovery validation-detail logging and tests in the same observe-recovery and gateway recovery surface. (role: recent adjacent maintainer; confidence: medium; commits: aa9db998f7bf; files: src/config/io.observe-recovery.ts, src/config/io.observe-recovery.test.ts, src/gateway/config-recovery-notice.ts)
  • jalehman: Commit f369939 touched observe-recovery and recovery policy to avoid plugin-local config rollback, and the PR discussion also routes review attention to Josh Lehman. (role: adjacent recovery policy owner; confidence: medium; commits: f369939fedd5; files: src/config/io.observe-recovery.ts, src/config/recovery-policy.ts, src/gateway/server-startup-config.ts)

Remaining risk / open question:

  • The PR body still uses closing syntax for [Bug]: Empty catch blocks swallow errors silently, hindering debugging #63423 even though this diff only addresses config health-state write failures and does not cover the pairing or infra catch blocks listed in that issue.
  • Exact head c49ed32f4567fe004a9c04d95348ebf8d46da557 currently has a checks-fast-protocol CI failure that needs resolution before merge.
  • The PR base is three commits behind current main, including an active CHANGELOG.md change, so a final rebase or merge sanity check is needed.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 7ac23eeeb581.

@sallyom sallyom force-pushed the fix/config-recovery-observe-errors branch 2 times, most recently from 67d3ef1 to 184f022 Compare May 1, 2026 06:09
@davidangularme
Copy link
Copy Markdown
Contributor

Hi @sallyom and @jalehman,

I see that my core work from #70515 (audit extensions and EACCES tests) has been integrated here. Since those commits were already passing and validated in my original PR, I’m concerned that merging them into this broader scope of health-state logging is complicating the CI resolution.

I’m happy to help debug, but if the CI failures persist, it might be more efficient for the project to land the focused, verified fix from #70515 first. This would ensure the primary bug is resolved while we iterate on the additional logging layers separately.

Let me know the best way to support this.

@davidangularme
Copy link
Copy Markdown
Contributor

I’ve just reviewed the CI logs for the checks-node-core failure here. It seems the integration of health-state logging is causing regressions that weren't present in my original PR #70515. As a maintainer, keeping the main branch stable is priority one. Since #70515 was already green and focused, I strongly suggest we land it now to fix the core bug. We can always follow up with @sallyom's additional logging once her CI is stabilized. Reopening #70515 is the fastest way to get back to a clean state.

@clawsweeper clawsweeper Bot mentioned this pull request May 1, 2026
25 tasks
@davidangularme
Copy link
Copy Markdown
Contributor

davidangularme commented May 1, 2026

@jalehman @steipete,

There is a significant issue with #75441 being labeled as 'superseding' my work. The claim that it "resolves the full issue #63423" is factually incorrect; that issue requires changes in src/pairing/ and src/infra/ which are untouched here.

By bundling my stable, verified fixes into a "full resolution" package that doesn't actually exist, this PR has introduced a structural regression. Specifically, it is currently failing CI because it attempts to import internal functions recently privatized in ad3e4db—functions my original fix does not depend on.

We are now in a situation where a green, verified baseline is being held hostage by a 'Red' PR that misrepresents its own scope. I have provided a standalone, passing branch (fix/config-restore-baseline) to restore stability immediately. I suggest landing the verified commits independently rather than waiting for a broken and over-scoped packaging to be reworked.

@davidangularme
Copy link
Copy Markdown
Contributor

@sallyom A case study in how open-source contributions get lost — told entirely through public, verifiable facts.

The bug:
In openclaw, when a config backup restore fails (disk full, EACCES), the system logs "Config auto-restored from backup" and writes valid: true to the audit record. Users and audit tooling believe the config was repaired. It was not. A corrupted config persists silently.

What I did:
• Identified the root cause: a bare catch {} around copyFile in io.observe-recovery.ts (lines 659-662) that swallowed all errors unconditionally.
• Submitted PR #70515 with the fix, covering both async and sync recovery paths.
• Added two new audit fields: restoreErrorCode and restoreErrorMessage.
• Wrote regression tests with injected EACCES failures for both paths.
• Greptile's automated review: 5/5 confidence, "Safe to merge."
• Addressed every piece of review feedback (sync mirror test + log formatting) in a follow-up commit.
• CI was green. A maintainer (jalehman) self-assigned to the PR.

What happened next:
• Another contributor (sallyom) opened PR #75441, cherry-picking my commits, and labeling it "Includes and supersedes #70515."
• sallyom commented on my PR: "#75441 will resolve the full issue in one PR."
• An automated bot (clawsweeper) closed my PR as "duplicate/superseded."
• No human maintainer reviewed or approved this decision.

The factual problems with these claims:
• "Will resolve the full issue in one PR": The diff of #75441 only touches src/config/. Issue #63423 ("Empty catch blocks swallow errors silently") explicitly requires changes in src/pairing/ and src/infra/. These directories are untouched.
• "Fixes #63423" is used as a GitHub tag in the PR header. This is an auto-close tag — if merged, it would automatically close issue #63423 as resolved, even though the required files are not addressed. The project's bug tracking would be silently corrupted.
• "Supersedes #70515": The PR that claims to supersede mine has 5 failing CI checks and unresolved merge conflicts. The PR it replaced was green.
• The CI failures are caused by importing functions (observeConfigSnapshot, observeConfigSnapshotSync) that were recently moved to internal scope by a maintainer in commit ad3e4db. My original fix had no dependency on these functions.

What I did to try to resolve it:
• Offered to help debug the failing PR.
• Proposed landing the focused, verified fix first, then iterating on the broader scope separately.
• Created a standalone baseline branch (fix/config-restore-baseline, PR #75568) to give maintainers a clean path.
• Opened issue #75651 documenting the structural conflicts with evidence.

The response:
• PR #75568 was closed.
• Issue #75651 was closed as "not planned."
• The clawsweeper bot's own analysis on that issue confirmed every structural point I raised: "PR diff is config-only [...] with no src/pairing or src/infra changes" and "Current main keeps observe snapshot helpers internal."

The current state:
• The original bug is still on main.
• My verified, green fix sits in a closed PR.
• The PR that replaced it cannot merge.
• The bot that closed my work also produced the analysis that confirms the replacement doesn't do what it claims.

For context, this is not my first contribution to openclaw. PR #74453 (fix for voice-call webhook in-flight limiter) was merged.

I'm not posting this to call anyone out. Every commit, CI log, bot analysis, and comment is public. I'm posting it because this pattern — where a working contribution is absorbed into a larger, broken scope, the original is closed by automation, and the contributor loses all agency — is one of the invisible friction points that drives people away from open source.

The facts are the facts.

#OpenSource #SoftwareEngineering #CodeReview #DevCommunity

@sallyom sallyom marked this pull request as draft May 1, 2026 18:04
@sallyom
Copy link
Copy Markdown
Contributor Author

sallyom commented May 1, 2026

holding this as DRAFT PR until #70515 merges, will rebase off of that.

@sallyom sallyom force-pushed the fix/config-recovery-observe-errors branch from 184f022 to c49ed32 Compare May 1, 2026 21:55
@sallyom sallyom marked this pull request as ready for review May 1, 2026 21:58
@sallyom sallyom merged commit 667371d into openclaw:main May 1, 2026
220 of 223 checks passed
@sallyom
Copy link
Copy Markdown
Contributor Author

sallyom commented May 1, 2026

Merged #75441 via squash.\n\nPrepared head SHA: c49ed32\nMerge commit: 667371d

lxe pushed a commit to lxe/openclaw that referenced this pull request May 6, 2026
Merged via squash.

Prepared head SHA: c49ed32
Co-authored-by: sallyom <[email protected]>
Reviewed-by: @sallyom
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

maintainer Maintainer-authored PR size: S

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Empty catch blocks swallow errors silently, hindering debugging

2 participants