fix: honest AAAK stats — word-based token estimator, lossy labels#147
Merged
fix: honest AAAK stats — word-based token estimator, lossy labels#147
Conversation
- Replace len(text)//3 token heuristic with word-based estimate (~1.3 tokens/word) - Old heuristic inflated compression ratios by ~3-5x - Update docstrings: "compression" → "lossy summarization" - Update module docstring to clarify AAAK is NOT lossless - compression_stats() now returns honest field names and a note - CLI output labels ratios as lossy Fixes #43
mvalentsev
added a commit
to mvalentsev/mempalace
that referenced
this pull request
Apr 7, 2026
PR MemPalace#147 renamed compression_stats fields (ratio -> size_ratio, compressed_chars -> summary_chars) and switched count_tokens to a word-based heuristic, but the test_dialect tests from PR MemPalace#131 still assert the old API and fail on main. Bring TestCompressionStats.test_stats in line with the current dict keys (size_ratio, summary_chars, summary_tokens_est) and update test_count_tokens to match the word-based formula, with extra coverage for the empty and single-word edge cases around max(1, ...). This unblocks CI on main, which currently fails on these two tests.
3 tasks
This was referenced Apr 7, 2026
igorls
added a commit
to igorls/mempalace
that referenced
this pull request
Apr 7, 2026
- stats["ratio"] → stats["size_ratio"] - stats["compressed_chars"] → stats["summary_chars"] - count_tokens now uses word-based estimation (2 not len//3) - Remove unused tmp_path_factory param from _isolate_home fixture
igorls
added a commit
to igorls/mempalace
that referenced
this pull request
Apr 7, 2026
…nused fixture param
igorls
added a commit
to igorls/mempalace
that referenced
this pull request
Apr 7, 2026
…nused fixture param
igorls
added a commit
to igorls/mempalace
that referenced
this pull request
Apr 7, 2026
…nused fixture param
igorls
added a commit
to igorls/mempalace
that referenced
this pull request
Apr 7, 2026
…nused fixture param
igorls
added a commit
to igorls/mempalace
that referenced
this pull request
Apr 7, 2026
…nused fixture param
igorls
added a commit
to igorls/mempalace
that referenced
this pull request
Apr 7, 2026
…nused fixture param
igorls
added a commit
to igorls/mempalace
that referenced
this pull request
Apr 7, 2026
…nused fixture param
mvalentsev
added a commit
to mvalentsev/mempalace
that referenced
this pull request
Apr 7, 2026
The honest-stats rename in PR MemPalace#147 changed the keys returned by Dialect.compression_stats() (ratio -> size_ratio, compressed_chars -> summary_chars, original_tokens / compressed_tokens -> original_tokens_est / summary_tokens_est). cmd_compress still reads the old names, so mempalace compress throws KeyError on the first drawer it touches and the feature is effectively dead. Also fix the summary line at the bottom of cmd_compress. It called count_tokens("x" * total_original), but count_tokens is word-based (max(1, int(len(text.split()) * 1.3))), and a string of repeated xs is a single "word", so both totals were always 1. Accumulate the per-drawer estimates during the main loop instead, and use a token-based ratio so the summary line is self-consistent with the per-drawer dry-run output. The storage metadata key names on the compressed collection (compression_ratio, original_tokens) stay the same for compatibility with anything already reading them. Only the source of the values is updated. Fixes MemPalace#159 (points 1 and 2)
4 tasks
mvalentsev
added a commit
to mvalentsev/mempalace
that referenced
this pull request
Apr 9, 2026
The honest-stats rename in PR MemPalace#147 changed the keys returned by Dialect.compression_stats() (ratio -> size_ratio, compressed_chars -> summary_chars, original_tokens / compressed_tokens -> original_tokens_est / summary_tokens_est). cmd_compress still reads the old names, so mempalace compress throws KeyError on the first drawer it touches and the feature is effectively dead. Also fix the summary line at the bottom of cmd_compress. It called count_tokens("x" * total_original), but count_tokens is word-based (max(1, int(len(text.split()) * 1.3))), and a string of repeated xs is a single "word", so both totals were always 1. Accumulate the per-drawer estimates during the main loop instead, and use a token-based ratio so the summary line is self-consistent with the per-drawer dry-run output. The storage metadata key names on the compressed collection (compression_ratio, original_tokens) stay the same for compatibility with anything already reading them. Only the source of the values is updated. Fixes MemPalace#159 (points 1 and 2)
mvalentsev
added a commit
to mvalentsev/mempalace
that referenced
this pull request
Apr 9, 2026
The honest-stats rename in PR MemPalace#147 changed the keys returned by Dialect.compression_stats() (ratio -> size_ratio, compressed_chars -> summary_chars, original_tokens / compressed_tokens -> original_tokens_est / summary_tokens_est). cmd_compress still reads the old names, so mempalace compress throws KeyError on the first drawer it touches and the feature is effectively dead. Also fix the summary line at the bottom of cmd_compress. It called count_tokens("x" * total_original), but count_tokens is word-based (max(1, int(len(text.split()) * 1.3))), and a string of repeated xs is a single "word", so both totals were always 1. Accumulate the per-drawer estimates during the main loop instead, and use a token-based ratio so the summary line is self-consistent with the per-drawer dry-run output. The storage metadata key names on the compressed collection (compression_ratio, original_tokens) stay the same for compatibility with anything already reading them. Only the source of the values is updated. Fixes MemPalace#159 (points 1 and 2)
mvalentsev
added a commit
to mvalentsev/mempalace
that referenced
this pull request
Apr 9, 2026
The honest-stats rename in PR MemPalace#147 changed the keys returned by Dialect.compression_stats() (ratio -> size_ratio, compressed_chars -> summary_chars, original_tokens / compressed_tokens -> original_tokens_est / summary_tokens_est). cmd_compress still reads the old names, so mempalace compress throws KeyError on the first drawer it touches and the feature is effectively dead. Also fix the summary line at the bottom of cmd_compress. It called count_tokens("x" * total_original), but count_tokens is word-based (max(1, int(len(text.split()) * 1.3))), and a string of repeated xs is a single "word", so both totals were always 1. Accumulate the per-drawer estimates during the main loop instead, and use a token-based ratio so the summary line is self-consistent with the per-drawer dry-run output. The storage metadata key names on the compressed collection (compression_ratio, original_tokens) stay the same for compatibility with anything already reading them. Only the source of the values is updated. Fixes MemPalace#159 (points 1 and 2)
milla-jovovich
pushed a commit
that referenced
this pull request
Apr 9, 2026
PyPI release cut covering 39 merged PRs since v3.0.0 on 2026-04-06. Highlights: Claude/Codex plugin packaging (#270), security hardening (#387), honest AAAK stats + benchmark corrections (#147), Windows compatibility fixes, Knowledge Graph WAL mode + batching, 10K limit safety caps, and much more. See GitHub release notes for full changelog.
6 tasks
mvalentsev
added a commit
to mvalentsev/mempalace
that referenced
this pull request
Apr 9, 2026
The honest-stats rename in PR MemPalace#147 changed the keys returned by Dialect.compression_stats() (ratio -> size_ratio, compressed_chars -> summary_chars, original_tokens / compressed_tokens -> original_tokens_est / summary_tokens_est). cmd_compress still reads the old names, so mempalace compress throws KeyError on the first drawer it touches and the feature is effectively dead. Also fix the summary line at the bottom of cmd_compress. It called count_tokens("x" * total_original), but count_tokens is word-based (max(1, int(len(text.split()) * 1.3))), and a string of repeated xs is a single "word", so both totals were always 1. Accumulate the per-drawer estimates during the main loop instead, and use a token-based ratio so the summary line is self-consistent with the per-drawer dry-run output. The storage metadata key names on the compressed collection (compression_ratio, original_tokens) stay the same for compatibility with anything already reading them. Only the source of the values is updated. Fixes MemPalace#159 (points 1 and 2)
milla-jovovich
added a commit
that referenced
this pull request
Apr 9, 2026
PyPI release cut covering 39 merged PRs since v3.0.0 on 2026-04-06. Highlights: Claude/Codex plugin packaging (#270), security hardening (#387), honest AAAK stats + benchmark corrections (#147), Windows compatibility fixes, Knowledge Graph WAL mode + batching, 10K limit safety caps, and much more. See GitHub release notes for full changelog. Co-authored-by: milla-jovovich <[email protected]>
mvalentsev
added a commit
to mvalentsev/mempalace
that referenced
this pull request
Apr 9, 2026
The honest-stats rename in PR MemPalace#147 changed the keys returned by Dialect.compression_stats() (ratio -> size_ratio, compressed_chars -> summary_chars, original_tokens / compressed_tokens -> original_tokens_est / summary_tokens_est). cmd_compress still reads the old names, so mempalace compress throws KeyError on the first drawer it touches and the feature is effectively dead. Also fix the summary line at the bottom of cmd_compress. It called count_tokens("x" * total_original), but count_tokens is word-based (max(1, int(len(text.split()) * 1.3))), and a string of repeated xs is a single "word", so both totals were always 1. Accumulate the per-drawer estimates during the main loop instead, and use a token-based ratio so the summary line is self-consistent with the per-drawer dry-run output. The storage metadata key names on the compressed collection (compression_ratio, original_tokens) stay the same for compatibility with anything already reading them. Only the source of the values is updated. Fixes MemPalace#159 (points 1 and 2)
mvalentsev
added a commit
to mvalentsev/mempalace
that referenced
this pull request
Apr 10, 2026
The honest-stats rename in PR MemPalace#147 changed the keys returned by Dialect.compression_stats() (ratio -> size_ratio, compressed_chars -> summary_chars, original_tokens / compressed_tokens -> original_tokens_est / summary_tokens_est). cmd_compress still reads the old names, so mempalace compress throws KeyError on the first drawer it touches and the feature is effectively dead. Also fix the summary line at the bottom of cmd_compress. It called count_tokens("x" * total_original), but count_tokens is word-based (max(1, int(len(text.split()) * 1.3))), and a string of repeated xs is a single "word", so both totals were always 1. Accumulate the per-drawer estimates during the main loop instead, and use a token-based ratio so the summary line is self-consistent with the per-drawer dry-run output. The storage metadata key names on the compressed collection (compression_ratio, original_tokens) stay the same for compatibility with anything already reading them. Only the source of the values is updated. Fixes MemPalace#159 (points 1 and 2)
mvalentsev
added a commit
to mvalentsev/mempalace
that referenced
this pull request
Apr 10, 2026
The new test_cmd_compress_dry_run and test_cmd_compress_stores_results tests (added upstream after this branch diverged) mock compression_stats() with the old key names. Update the mocks to use the post-MemPalace#147 keys (original_tokens_est, summary_tokens_est, size_ratio, summary_chars) so they match what the fixed cmd_compress actually reads.
phobicdotno
pushed a commit
to phobicdotno/mempalace-gpu
that referenced
this pull request
Apr 10, 2026
PyPI release cut covering 39 merged PRs since v3.0.0 on 2026-04-06. Highlights: Claude/Codex plugin packaging (MemPalace#270), security hardening (MemPalace#387), honest AAAK stats + benchmark corrections (MemPalace#147), Windows compatibility fixes, Knowledge Graph WAL mode + batching, 10K limit safety caps, and much more. See GitHub release notes for full changelog. Co-authored-by: milla-jovovich <[email protected]>
mvalentsev
added a commit
to mvalentsev/mempalace
that referenced
this pull request
Apr 10, 2026
The honest-stats rename in PR MemPalace#147 changed the keys returned by Dialect.compression_stats() (ratio -> size_ratio, compressed_chars -> summary_chars, original_tokens / compressed_tokens -> original_tokens_est / summary_tokens_est). cmd_compress still reads the old names, so mempalace compress throws KeyError on the first drawer it touches and the feature is effectively dead. Also fix the summary line at the bottom of cmd_compress. It called count_tokens("x" * total_original), but count_tokens is word-based (max(1, int(len(text.split()) * 1.3))), and a string of repeated xs is a single "word", so both totals were always 1. Accumulate the per-drawer estimates during the main loop instead, and use a token-based ratio so the summary line is self-consistent with the per-drawer dry-run output. The storage metadata key names on the compressed collection (compression_ratio, original_tokens) stay the same for compatibility with anything already reading them. Only the source of the values is updated. Fixes MemPalace#159 (points 1 and 2)
mvalentsev
added a commit
to mvalentsev/mempalace
that referenced
this pull request
Apr 10, 2026
The new test_cmd_compress_dry_run and test_cmd_compress_stores_results tests (added upstream after this branch diverged) mock compression_stats() with the old key names. Update the mocks to use the post-MemPalace#147 keys (original_tokens_est, summary_tokens_est, size_ratio, summary_chars) so they match what the fixed cmd_compress actually reads.
mvalentsev
added a commit
to mvalentsev/mempalace
that referenced
this pull request
Apr 10, 2026
The honest-stats rename in PR MemPalace#147 changed the keys returned by Dialect.compression_stats() (ratio -> size_ratio, compressed_chars -> summary_chars, original_tokens / compressed_tokens -> original_tokens_est / summary_tokens_est). cmd_compress still reads the old names, so mempalace compress throws KeyError on the first drawer it touches and the feature is effectively dead. Also fix the summary line at the bottom of cmd_compress. It called count_tokens("x" * total_original), but count_tokens is word-based (max(1, int(len(text.split()) * 1.3))), and a string of repeated xs is a single "word", so both totals were always 1. Accumulate the per-drawer estimates during the main loop instead, and use a token-based ratio so the summary line is self-consistent with the per-drawer dry-run output. The storage metadata key names on the compressed collection (compression_ratio, original_tokens) stay the same for compatibility with anything already reading them. Only the source of the values is updated. Fixes MemPalace#159 (points 1 and 2)
mvalentsev
added a commit
to mvalentsev/mempalace
that referenced
this pull request
Apr 10, 2026
The new test_cmd_compress_dry_run and test_cmd_compress_stores_results tests (added upstream after this branch diverged) mock compression_stats() with the old key names. Update the mocks to use the post-MemPalace#147 keys (original_tokens_est, summary_tokens_est, size_ratio, summary_chars) so they match what the fixed cmd_compress actually reads.
mvalentsev
added a commit
to mvalentsev/mempalace
that referenced
this pull request
Apr 11, 2026
The honest-stats rename in PR MemPalace#147 changed the keys returned by Dialect.compression_stats() (ratio -> size_ratio, compressed_chars -> summary_chars, original_tokens / compressed_tokens -> original_tokens_est / summary_tokens_est). cmd_compress still reads the old names, so mempalace compress throws KeyError on the first drawer it touches and the feature is effectively dead. Also fix the summary line at the bottom of cmd_compress. It called count_tokens("x" * total_original), but count_tokens is word-based (max(1, int(len(text.split()) * 1.3))), and a string of repeated xs is a single "word", so both totals were always 1. Accumulate the per-drawer estimates during the main loop instead, and use a token-based ratio so the summary line is self-consistent with the per-drawer dry-run output. The storage metadata key names on the compressed collection (compression_ratio, original_tokens) stay the same for compatibility with anything already reading them. Only the source of the values is updated. Fixes MemPalace#159 (points 1 and 2)
mvalentsev
added a commit
to mvalentsev/mempalace
that referenced
this pull request
Apr 11, 2026
The new test_cmd_compress_dry_run and test_cmd_compress_stores_results tests (added upstream after this branch diverged) mock compression_stats() with the old key names. Update the mocks to use the post-MemPalace#147 keys (original_tokens_est, summary_tokens_est, size_ratio, summary_chars) so they match what the fixed cmd_compress actually reads.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The AAAK dialect's
count_tokens()was usinglen(text) // 3which inflated compression ratios by 3-5x. Community caught this in #43.Changes:
compress()docstring: "~30x smaller" → "AAAK-formatted summary string"compression_stats(): honest field names (summary_tokens_est), adds note about lossy natureFixes #43
Test plan
pytest tests/ -v— 27/27 passruff check . && ruff format --check .— cleanDialect().compress(text)produces honest stats