fix(memory): retry transient FileProvider-backed reads#85351
Conversation
|
Codex review: needs maintainer review before merge. Reviewed June 1, 2026, 3:16 PM ET / 19:16 UTC. Summary PR surface: Source +96, Tests +230, Generated 0. Total +326 across 14 files. Reproducibility: yes. source-level reproduction is high confidence: current main has non-retried memory/session/wiki reads, and the PR proof injects the reported transient errno failures through real runtime modules. I did not run a live iCloud/FileProvider repro in this read-only review. Review metrics: 1 noteworthy metric.
Merge readiness Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch. Rank-up moves:
Risk before merge
Maintainer options:
Next step before merge
Security Review detailsBest possible solution: Land this shape after maintainers accept the public SDK helper contract, or revise the branch so the shared retry policy stays internal while preserving the same memory-core/session/wiki coverage. Do we have a high-confidence way to reproduce the issue? Yes, source-level reproduction is high confidence: current main has non-retried memory/session/wiki reads, and the PR proof injects the reported transient errno failures through real runtime modules. I did not run a live iCloud/FileProvider repro in this read-only review. Is this the best way to solve the issue? Yes for the runtime bug: one shared bounded retry helper at the memory-read seam is cleaner than ad hoc loops. The unresolved part is whether exposing that helper through the public Plugin SDK facade is the best API shape. AGENTS.md: found and applied where relevant. Codex review notes: model gpt-5.5, reasoning high; reviewed against 66f797b22c10. Label changesLabel changes:
Label justifications:
Evidence reviewedPR surface: Source +96, Tests +230, Generated 0. Total +326 across 14 files. View PR surface stats
What I checked:
Likely related people:
What the crustacean ranks mean
Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics. How this review workflow works
|
|
ClawSweeper PR egg ✨ Hatched: 🥚 common Gilded Test Hopper Hatch commandComment Hatchability rules:
Rarity: 🥚 common. What is this egg doing here?
|
4f89fd6 to
871abcf
Compare
871abcf to
7e3749f
Compare
7e3749f to
98fe321
Compare
98fe321 to
b2bf1cb
Compare
|
@clawsweeper re-review |
|
🦞🧹 I asked ClawSweeper to review this item again. Re-review progress:
|
b2bf1cb to
ebd8580
Compare
ebd8580 to
6269cdb
Compare
|
Thanks for the retry coverage here. I trimmed the retry budget while rebasing this branch because these are local filesystem reads, not network/API calls. The current policy is intentionally small: initial read + 2 retries, with 25ms then 50ms backoff. That keeps the fix targeted at short transient file-read contention without letting memory indexing/wiki compile stall for long periods across many files. It also matches the nearby memory atomic-reindex retry shape more closely than the previous longer tail. So the intended contract is: retry brief local FS blips, then fail fast if the read is still unhealthy. |
Summary
Unknown system error -11/ FileProvider-style read failures from memory-host reads, session transcript reads, and wiki page reads.memory-host-sdk, route the affected memory-core and memory-wiki read paths through it, and preserve existing missing-file behavior.Motivation
mainand the latest shipped release still propagate those failures directly, which can abort memory sync or break librarian wiki compile.Change Type (select all)
Scope (select all touched areas)
Linked Issue/PR
Real behavior proof (required for external PRs)
errno -11read failures no longer abort workspace memory reads, session transcript indexing, or wiki compile page reads.node --import tsx --input-type=module.{ "attempts": { "memoryOpen": 2, "sessionOpen": 2, "wikiReadFile": 6 }, "memoryResult": { "text": "alpha\nbeta", "path": "memory/retry.md", "from": 1, "lines": 2 }, "sessionLineMapLength": 1, "sessionContent": "User: hello", "compiledSourceCount": 1, "updatedFiles": [ ".openclaw-wiki/cache/agent-digest.json", ".openclaw-wiki/cache/claims.jsonl", "concepts/index.md", "entities/index.md", "index.md", "reports/claim-health.md", "reports/contradictions.md", "reports/index.md", "reports/low-confidence.md", "reports/open-questions.md", "reports/person-agent-directory.md", "reports/privacy-review.md", "reports/provenance-coverage.md", "reports/relationship-graph.md", "reports/stale-pages.md", "sources/alpha.md", "sources/index.md", "syntheses/index.md" ] }-11failures, and the real runtime modules returned the expected memory content, session transcript content, and compiled wiki output instead of surfacing a raw read error.main, injecting the same transientNodeJS.ErrnoExceptioninto these read paths propagates the raw failure through memory sync / compile instead of recovering.Root Cause (if applicable)
errno -11/EAGAIN/EWOULDBLOCK/EDEADLKbehavior across the affected read surfaces.mainand the shipped release still exposed these raw failures from session sync and wiki compile paths, with related FileProvider behavior already noted in readPageSummaries: no concurrency limit on fs.readFile, triggers EDEADLK on iCloud/FileProvider-backed vaults #68738.Regression Test Plan (if applicable)
packages/memory-host-sdk/src/host/read-file.test.ts,packages/memory-host-sdk/src/host/internal.test.ts,extensions/memory-core/src/memory/manager.read-file.test.ts,extensions/memory-core/src/memory/manager-sync-ops.startup-catchup.test.ts,extensions/memory-wiki/src/compile.test.ts-11/EAGAIN-family failures on memory file reads, session transcript reads, session delta newline reads, and wiki page reads recover inside the shared helper without changing missing-file behavior.User-visible / Behavior Changes
Memory sync and wiki compile now retry transient FileProvider-style read failures instead of surfacing raw
Unknown system error -11failures immediately. Missing files still behave exactly as before, and permanent read errors still propagate.Diagram (if applicable)
Security Impact (required)
Yes/No) NoYes/No) NoYes/No) NoYes/No) NoYes/No) NoYes, explain risk + mitigation: N/ARepro + Verification
Environment
Steps
node scripts/run-vitest.mjs packages/memory-host-sdk/src/host/read-file.test.ts packages/memory-host-sdk/src/host/internal.test.ts extensions/memory-core/src/memory/manager.read-file.test.ts extensions/memory-core/src/memory/manager-sync-ops.startup-catchup.test.ts extensions/memory-wiki/src/compile.test.tspnpm check:changedpnpm plugin-sdk:api:gennode --max-old-space-size=8192 --import tsx scripts/generate-plugin-sdk-api-baseline.ts --checkExpected
Actual
pnpm check:changedpassed.node --max-old-space-size=8192 --import tsx scripts/generate-plugin-sdk-api-baseline.ts --checkpassed after regenerating the baseline hash.Evidence
Attach at least one:
Human Verification (required)
What you personally verified (not just CI), and how:
readMemoryFile,buildSessionEntry, andcompileMemoryWikiVault; focused regression suite; changed checks; Plugin SDK API baseline check.Review Conversations
If a bot review conversation is addressed by this PR, resolve that conversation yourself. Do not leave bot review conversation cleanup for maintainers.
Compatibility / Migration
Yes/No) YesYes/No) NoYes/No) NoRisks and Mitigations
-11/EAGAIN-family signatures with a short bounded backoff and still rethrows permanent errors.src/plugin-sdk/memory-core-host-engine-storage.tsintentionally and regeneratesdocs/.generated/plugin-sdk-api-baseline.sha256.