Skip to content

fix(memory): use WAL journal mode for SQLite memory index#36535

Open
anthonyfig wants to merge 1 commit intoopenclaw:mainfrom
anthonyfig:fix/memory-wal-journal
Open

fix(memory): use WAL journal mode for SQLite memory index#36535
anthonyfig wants to merge 1 commit intoopenclaw:mainfrom
anthonyfig:fix/memory-wal-journal

Conversation

@anthonyfig
Copy link
Copy Markdown

Set PRAGMA journal_mode=WAL and PRAGMA busy_timeout=5000 when opening the memory index database. WAL mode survives SIGTERM/SIGKILL mid-write and prevents database corruption that occurs with the default rollback journal when the gateway is killed during write operations.

busy_timeout prevents immediate SQLITE_BUSY errors under contention.

Closes #36035

Summary

  • Problem: Memory index SQLite database uses journal_mode=delete (SQLite default). If the gateway is killed mid-write (SIGTERM during restart/update/config change), the journal file is deleted before the write completes, corrupting the database.
  • Why it matters: Users hit "database disk image is malformed" and memory search stops working entirely. The only fix is manual deletion and full reindex (rm + openclaw memory index). Gateway restarts are frequent.
  • What changed: Added PRAGMA journal_mode = WAL and PRAGMA busy_timeout = 5000 in openDatabaseAtPath() — the single entry point where all memory SQLite databases are opened (main index + temp reindex). Added 3 tests.
  • What did NOT change (scope boundary): No schema changes, no config surface changes, no migration. QMD manager (read-only, separate code path) is not touched.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

User-visible / Behavior Changes

None visible. Existing databases are transparently converted to WAL on first open (SQLite handles this automatically). Two sidecar files (-wal, -shm) will appear next to the database file — moveIndexFiles() and removeIndexFiles() already handle these suffixes.

Security Impact (required)

  • New permissions/capabilities? No
  • Secrets/tokens handling changed? No
  • New/changed network calls? No
  • Command/tool execution surface changed? No
  • Data access scope changed? No

Repro + Verification

Environment

  • OS: Linux (Ubuntu 24.04)
  • Runtime/container: Node.js 22.22.0
  • Model/provider: Mock embeddings (unit tests)

Steps

  1. pnpm install
  2. vitest run src/memory/manager.wal-journal.test.ts (focused WAL tests)
  3. vitest run src/memory/ (full memory suite)
  4. pnpm test (full project suite)

Expected

  • All tests pass, PRAGMA journal_mode returns wal on opened databases.

Actual

  • 3/3 new WAL tests pass. 35/35 memory test files pass (237 tests). 817/817 full suite passes (6664 tests).

Evidence

  • Failing test/log before + passing after

New test src/memory/manager.wal-journal.test.ts asserts:

  1. PRAGMA journal_mode returns wal after opening
  2. PRAGMA busy_timeout returns 5000
  3. WAL mode persists after a full safe reindex (temp DB swap)

Human Verification (required)

  • Verified scenarios: WAL active on fresh DB, WAL persists after forced reindex, busy_timeout correctly set, all 35 memory test files pass, full project suite passes.
  • Edge cases checked: Safe reindex (temp DB also gets WAL via same code path), unsafe reindex (test-only path), readonly recovery reopens through openDatabase() which calls openDatabaseAtPath().
  • What you did not verify: Live migration of a pre-existing production database from DELETE to WAL mode (SQLite documents this as automatic and transparent, but not tested against a real user index).

Compatibility / Migration

  • Backward compatible? Yes
  • Config/env changes? No
  • Migration needed? No — SQLite converts automatically on first PRAGMA journal_mode = WAL.

Failure Recovery (if this breaks)

  • How to disable/revert this change quickly: Remove the two db.exec("PRAGMA ...") lines in manager-sync-ops.ts:openDatabaseAtPath().
  • Files/config to restore: src/memory/manager-sync-ops.ts
  • Known bad symptoms reviewers should watch for: SQLITE_BUSY errors (unlikely with 5s busy_timeout), unexpected -wal/-shm files appearing alongside main.sqlite (expected behavior).

Risks and Mitigations

  • Risk: WAL sidecar files (-wal, -shm) must travel with the database during atomic reindex swaps.
    • Mitigation: moveIndexFiles() and removeIndexFiles() already handle ["", "-wal", "-shm"] suffixes — this was designed for WAL from the start.

Set PRAGMA journal_mode=WAL and PRAGMA busy_timeout=5000 when opening the
memory index database. WAL mode survives SIGTERM/SIGKILL mid-write and
prevents database corruption that occurs with the default rollback journal
when the gateway is killed during write operations.

busy_timeout prevents immediate SQLITE_BUSY errors under contention.

Closes openclaw#36035
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c10436bd65

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

return new DatabaseSync(dbPath, { allowExtension: this.settings.store.vector.enabled });
const db = new DatabaseSync(dbPath, { allowExtension: this.settings.store.vector.enabled });
// WAL survives SIGTERM/SIGKILL mid-write and allows concurrent readers.
db.exec("PRAGMA journal_mode = WAL");
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Set busy_timeout before switching journal mode

When an existing index is still in journal_mode=delete and another process holds a write transaction (for example, overlapping gateway processes during restart), PRAGMA journal_mode = WAL can raise SQLITE_BUSY; because busy_timeout is configured only afterward, this failure is immediate and aborts manager initialization. Before this change, opening the DB in that state could still proceed (e.g., schema no-op reads), so this introduces a startup regression that can temporarily disable memory search until the lock clears. Configure PRAGMA busy_timeout first (or retry the WAL switch) so first-open migration to WAL is lock-tolerant.

Useful? React with 👍 / 👎.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 5, 2026

Greptile Summary

Summary

This PR adds PRAGMA journal_mode = WAL and PRAGMA busy_timeout = 5000 to openDatabaseAtPath() — the single entry point for all memory SQLite databases — fixing a real data-corruption bug caused by SIGTERM/SIGKILL arriving mid-write with the default rollback journal. The change is minimal, well-scoped, and already fits the existing file-swap infrastructure (moveIndexFiles / removeIndexFiles already handle -wal / -shm sidecar files).

Correctness: Both PRAGMAs are applied before the DB is used, so the main index and the temp reindex DB both get WAL mode through the same code path.

Backward compatibility: SQLite converts a DELETE-journal database to WAL transparently on the first PRAGMA journal_mode = WAL; no manual migration required.

WAL mode verification: db.exec("PRAGMA journal_mode = WAL") silently discards the returned mode string. On unsupported filesystems (NFS/SMB, read-only mounts), SQLite can silently keep the current mode rather than enabling WAL. Consider verifying the result and logging a warning to surface this failure rather than leaving the database unprotected on edge-case mounts.

Tests: 3 new tests cover fresh-DB WAL mode, busy_timeout value, and WAL persistence through a full safe reindex — adequate coverage for the change.

Confidence Score: 4/5

  • Safe to merge. The change is minimal, backward-compatible, and directly addresses an active data-corruption issue. One consideration: WAL activation can fail silently on unsupported filesystems — consider verifying the result.
  • This PR is safe to merge. The WAL + busy_timeout change is focused, well-tested (3 new tests), and transparent to existing databases. The only gap is that db.exec() silently discards the PRAGMA result, meaning WAL activation could fail without warning on edge-case filesystems (NFS/SMB, read-only mounts). On typical local filesystems this is not a concern, but since the PR's purpose is crash safety, surfacing a warning on failure would strengthen the fix. This is worth the author's consideration but is not blocking.
  • src/memory/manager-sync-ops.ts — consider adding verification that PRAGMA journal_mode = WAL actually succeeded, especially on non-standard filesystems.

Last reviewed commit: c10436b

Comment on lines +263 to +264
db.exec("PRAGMA journal_mode = WAL");
db.exec("PRAGMA busy_timeout = 5000");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WAL mode activation not verified

db.exec() discards the result of PRAGMA journal_mode = WAL. SQLite can silently keep the current mode (returning "delete" instead of "wal") when WAL is unsupported — e.g., on network filesystems (NFS/SMB), read-only mounts, or when another connection has the DB open in a different journaling mode. Since this PR's entire purpose is to guarantee WAL protection against corruption, it would be safer to verify the PRAGMA succeeded and log a warning if it didn't:

const modeRow = db.prepare("PRAGMA journal_mode = WAL").get() as { journal_mode: string };
if (modeRow.journal_mode !== "wal") {
  log.warn(`memory index: could not enable WAL journal mode (got '${modeRow.journal_mode}'); database may be at risk of corruption under SIGTERM`);
}
db.exec("PRAGMA busy_timeout = 5000");

This is a low-probability scenario on typical local filesystems, but since the fix's safety goal could silently fail without warning on edge-case mounts, surfacing a diagnostic message would help operators catch misconfiguration.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/memory/manager-sync-ops.ts
Line: 263-264

Comment:
**WAL mode activation not verified**

`db.exec()` discards the result of `PRAGMA journal_mode = WAL`. SQLite can silently keep the current mode (returning `"delete"` instead of `"wal"`) when WAL is unsupported — e.g., on network filesystems (NFS/SMB), read-only mounts, or when another connection has the DB open in a different journaling mode. Since this PR's entire purpose is to guarantee WAL protection against corruption, it would be safer to verify the PRAGMA succeeded and log a warning if it didn't:

```typescript
const modeRow = db.prepare("PRAGMA journal_mode = WAL").get() as { journal_mode: string };
if (modeRow.journal_mode !== "wal") {
  log.warn(`memory index: could not enable WAL journal mode (got '${modeRow.journal_mode}'); database may be at risk of corruption under SIGTERM`);
}
db.exec("PRAGMA busy_timeout = 5000");
```

This is a low-probability scenario on typical local filesystems, but since the fix's safety goal could silently fail without warning on edge-case mounts, surfacing a diagnostic message would help operators catch misconfiguration.

How can I resolve this? If you propose a fix, please make it concise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Memory SQLite should use WAL journal mode by default

2 participants