Skip to content

fix: rotate WAL file at 10 MB to prevent unbounded growth#573

Closed
arnoldwender wants to merge 1 commit intoMemPalace:developfrom
arnoldwender:fix/wal-log-rotation
Closed

fix: rotate WAL file at 10 MB to prevent unbounded growth#573
arnoldwender wants to merge 1 commit intoMemPalace:developfrom
arnoldwender:fix/wal-log-rotation

Conversation

@arnoldwender
Copy link
Copy Markdown
Contributor

Summary

  • Added WAL file rotation to _wal_log() — rotates write_log.jsonl when it exceeds 10 MB
  • Keeps one backup (write_log.jsonl.1), discards older backups on next rotation

Problem

The write-ahead log (~/.mempalace/wal/write_log.jsonl) appends every write operation forever with no rotation or size limit. Over months of heavy use with frequent add_drawer, kg_add, and diary_write calls, this file can grow to gigabytes, consuming disk space silently.

Fix

  • New _wal_rotate() function checks file size before each write
  • Rotates at 10 MB threshold (configurable via _WAL_MAX_BYTES)
  • Keeps exactly one backup (.jsonl.1) for recent audit trail
  • Rotation errors are logged but never block write operations
  • Backup file inherits 0o600 permissions

Test plan

  • WAL file under 10 MB — no rotation, writes normally
  • WAL file at 10 MB — rotated to .jsonl.1, new file created
  • Existing .jsonl.1 backup — overwritten on next rotation
  • Permission errors on rotation — logged, write continues

Copy link
Copy Markdown

@web3guru888 web3guru888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fixes a genuine production pain point — WAL files growing to gigabytes silently is something we've run into with high-frequency add_drawer patterns.

The implementation is clean. A few observations:

Overhead: _wal_rotate() calls _WAL_FILE.stat() on every single write. For high-throughput workloads this is a syscall per write. Not catastrophic, but could be batched (e.g., only check every N writes or every 60 seconds). At 10 MB / ~500 bytes per entry that's ~20,000 writes between rotations, so the amortized cost is low. Fine as-is.

Single backup race: If two concurrent writers both hit _wal_rotate() just as the file crosses 10 MB, both will try to rename and one will get FileNotFoundError. This is caught by the outer except Exception and logged, so it won't blow up — but the second caller will then write to a fresh file and the backup from the first rotation will be intact. Acceptable behavior.

0o600 permissions: chmod after rename is a TOCTOU window — the file is temporarily accessible at the backup path without restricted permissions. Ideally you'd rename to a temp path, chmod, then rename to .1. Minor, but worth noting if the WAL contains sensitive memory contents (it does).

The f-string → %s logging fix is a nice bonus (avoids evaluating the format string when the log level is filtered).

Overall: the core fix is correct and the edge cases are non-fatal. LGTM.

jphein added a commit to jphein/mempalace that referenced this pull request Apr 10, 2026
…compress keys

- Default new collections to cosine distance (hnsw:space=cosine) instead
  of L2 — fixes negative similarity scores (MemPalace#568)
- Filter unexpected MCP tool args before dispatch — prevents TypeError
  crash from extra params like top_k (MemPalace#572)
- Rotate WAL at 10 MB with one backup to prevent unbounded growth (MemPalace#573)
- Fix cmd_compress KeyError: align dict keys with compression_stats()
  return values (MemPalace#569)
- Fix spellcheck _load_known_names: read from "people" key, use dict
  keys as canonical names (MemPalace#570)
- Repair command uses cosine distance for rebuilt collection

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@jphein
Copy link
Copy Markdown
Collaborator

jphein commented Apr 10, 2026

Incorporated into #562 — WAL rotation at 10 MB with one .jsonl.1 backup. Errors during rotation are logged but never block writes.

Copy link
Copy Markdown

@web3guru888 web3guru888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WAL rotation is a sensible maintenance feature — unbounded growth is a real operational problem for anyone running intensive agent workloads. The 10 MB threshold is a reasonable default.

A few design notes:

Backup count: A single .jsonl.1 backup means you lose visibility into entries older than one rotation cycle. For most use cases that's fine, but if someone is using the WAL for audit purposes (e.g. replay after palace corruption), losing the .1 file on the second rotation is surprising. Worth documenting the retention policy explicitly — even a comment in _wal_rotate() saying "only the most recent backup is kept; use wal_replay before WAL grows >10 MB if you need full audit history" would set expectations correctly.

Rotation under concurrent writers: If two processes call _wal_log() near-simultaneously and both observe the size threshold, both may try to rename write_log.jsonl to write_log.jsonl.1. On POSIX, os.rename is atomic so one will win and the other will rotate an already-rotated file — resulting in some entries in .1 getting overwritten. For single-process use this is fine; for multi-process MCP server setups, worth a file lock around the rotate-then-reopen sequence. A simple fcntl.flock on the WAL file would cover it.

Missing tests: The checklist shows all four test scenarios as unchecked. These aren't hard to write — a tmp_path fixture with a pre-seeded WAL file over the threshold, assert that .1 is created and the original is reset. Would be good to have them before merge.

Configurable threshold: _WAL_MAX_BYTES as a module constant is the right approach — clean and easy to override in tests. Alternatively, expose it via mempalace.yaml config (wal_max_bytes: 10_000_000) so operators running high-throughput pipelines can tune it without patching.

Directionally right. The core logic is straightforward — the concurrent writer edge case is the main thing worth hardening before this goes to production.


[MemPalace-AGI integration — 215 tests, 710 KG entities]

@arnoldwender
Copy link
Copy Markdown
Contributor Author

Thanks for the detailed review — all valid points.

Overhead: Agreed, stat() per write amortizes well at ~20K writes between rotations. Leaving as-is.

Concurrent writers: Fair point on the POSIX rename race. For single-process MCP this is non-fatal as you noted. A follow-up could add fcntl.flock for multi-process safety if that becomes a supported pattern.

TOCTOU on chmod: The 0o700 parent directory limits exposure, but noted for a future hardening pass.

Tests: Will add rotation coverage — tmp_path with pre-seeded WAL over threshold, assert .1 created and original reset.

Backup count / retention: Good call on documenting this. Will add an inline comment clarifying single-backup semantics and the recommendation to replay before rotation if full audit history is needed.

Configurable threshold: Exposing via mempalace.yaml is a clean extension — happy to add that in a follow-up if there's interest.

@bensig bensig changed the base branch from main to develop April 11, 2026 22:21
@arnoldwender
Copy link
Copy Markdown
Contributor Author

Incorporated into #562 by @jphein — closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants