Skip to content

fix: honest AAAK stats — word-based token estimator, lossy labels#147

Merged
bensig merged 1 commit intomainfrom
fix/aaak-honest-stats
Apr 7, 2026
Merged

fix: honest AAAK stats — word-based token estimator, lossy labels#147
bensig merged 1 commit intomainfrom
fix/aaak-honest-stats

Conversation

@bensig
Copy link
Copy Markdown
Collaborator

@bensig bensig commented Apr 7, 2026

Summary

The AAAK dialect's count_tokens() was using len(text) // 3 which inflated compression ratios by 3-5x. Community caught this in #43.

Changes:

  • Token estimator now uses word count * 1.3 (conservative average)
  • Module docstring: "compressed" → "lossy summarization", adds NOTE about raw mode benchmark
  • compress() docstring: "~30x smaller" → "AAAK-formatted summary string"
  • compression_stats(): honest field names (summary_tokens_est), adds note about lossy nature
  • CLI output labels ratios as "lossy summary, not lossless compression"

Fixes #43

Test plan

  • pytest tests/ -v — 27/27 pass
  • ruff check . && ruff format --check . — clean
  • Manual: Dialect().compress(text) produces honest stats

- Replace len(text)//3 token heuristic with word-based estimate (~1.3 tokens/word)
- Old heuristic inflated compression ratios by ~3-5x
- Update docstrings: "compression" → "lossy summarization"
- Update module docstring to clarify AAAK is NOT lossless
- compression_stats() now returns honest field names and a note
- CLI output labels ratios as lossy

Fixes #43
@bensig bensig merged commit 68e3414 into main Apr 7, 2026
4 checks passed
mvalentsev added a commit to mvalentsev/mempalace that referenced this pull request Apr 7, 2026
PR MemPalace#147 renamed compression_stats fields (ratio -> size_ratio,
compressed_chars -> summary_chars) and switched count_tokens to a
word-based heuristic, but the test_dialect tests from PR MemPalace#131 still
assert the old API and fail on main.

Bring TestCompressionStats.test_stats in line with the current dict
keys (size_ratio, summary_chars, summary_tokens_est) and update
test_count_tokens to match the word-based formula, with extra
coverage for the empty and single-word edge cases around max(1, ...).

This unblocks CI on main, which currently fails on these two tests.
igorls added a commit to igorls/mempalace that referenced this pull request Apr 7, 2026
- stats["ratio"] → stats["size_ratio"]
- stats["compressed_chars"] → stats["summary_chars"]
- count_tokens now uses word-based estimation (2 not len//3)
- Remove unused tmp_path_factory param from _isolate_home fixture
igorls added a commit to igorls/mempalace that referenced this pull request Apr 7, 2026
igorls added a commit to igorls/mempalace that referenced this pull request Apr 7, 2026
igorls added a commit to igorls/mempalace that referenced this pull request Apr 7, 2026
igorls added a commit to igorls/mempalace that referenced this pull request Apr 7, 2026
igorls added a commit to igorls/mempalace that referenced this pull request Apr 7, 2026
igorls added a commit to igorls/mempalace that referenced this pull request Apr 7, 2026
igorls added a commit to igorls/mempalace that referenced this pull request Apr 7, 2026
mvalentsev added a commit to mvalentsev/mempalace that referenced this pull request Apr 7, 2026
The honest-stats rename in PR MemPalace#147 changed the keys returned by
Dialect.compression_stats() (ratio -> size_ratio, compressed_chars ->
summary_chars, original_tokens / compressed_tokens ->
original_tokens_est / summary_tokens_est). cmd_compress still reads
the old names, so mempalace compress throws KeyError on the first
drawer it touches and the feature is effectively dead.

Also fix the summary line at the bottom of cmd_compress. It called
count_tokens("x" * total_original), but count_tokens is word-based
(max(1, int(len(text.split()) * 1.3))), and a string of repeated
xs is a single "word", so both totals were always 1. Accumulate
the per-drawer estimates during the main loop instead, and use a
token-based ratio so the summary line is self-consistent with the
per-drawer dry-run output.

The storage metadata key names on the compressed collection
(compression_ratio, original_tokens) stay the same for compatibility
with anything already reading them. Only the source of the values
is updated.

Fixes MemPalace#159 (points 1 and 2)
mvalentsev added a commit to mvalentsev/mempalace that referenced this pull request Apr 9, 2026
The honest-stats rename in PR MemPalace#147 changed the keys returned by
Dialect.compression_stats() (ratio -> size_ratio, compressed_chars ->
summary_chars, original_tokens / compressed_tokens ->
original_tokens_est / summary_tokens_est). cmd_compress still reads
the old names, so mempalace compress throws KeyError on the first
drawer it touches and the feature is effectively dead.

Also fix the summary line at the bottom of cmd_compress. It called
count_tokens("x" * total_original), but count_tokens is word-based
(max(1, int(len(text.split()) * 1.3))), and a string of repeated
xs is a single "word", so both totals were always 1. Accumulate
the per-drawer estimates during the main loop instead, and use a
token-based ratio so the summary line is self-consistent with the
per-drawer dry-run output.

The storage metadata key names on the compressed collection
(compression_ratio, original_tokens) stay the same for compatibility
with anything already reading them. Only the source of the values
is updated.

Fixes MemPalace#159 (points 1 and 2)
mvalentsev added a commit to mvalentsev/mempalace that referenced this pull request Apr 9, 2026
The honest-stats rename in PR MemPalace#147 changed the keys returned by
Dialect.compression_stats() (ratio -> size_ratio, compressed_chars ->
summary_chars, original_tokens / compressed_tokens ->
original_tokens_est / summary_tokens_est). cmd_compress still reads
the old names, so mempalace compress throws KeyError on the first
drawer it touches and the feature is effectively dead.

Also fix the summary line at the bottom of cmd_compress. It called
count_tokens("x" * total_original), but count_tokens is word-based
(max(1, int(len(text.split()) * 1.3))), and a string of repeated
xs is a single "word", so both totals were always 1. Accumulate
the per-drawer estimates during the main loop instead, and use a
token-based ratio so the summary line is self-consistent with the
per-drawer dry-run output.

The storage metadata key names on the compressed collection
(compression_ratio, original_tokens) stay the same for compatibility
with anything already reading them. Only the source of the values
is updated.

Fixes MemPalace#159 (points 1 and 2)
mvalentsev added a commit to mvalentsev/mempalace that referenced this pull request Apr 9, 2026
The honest-stats rename in PR MemPalace#147 changed the keys returned by
Dialect.compression_stats() (ratio -> size_ratio, compressed_chars ->
summary_chars, original_tokens / compressed_tokens ->
original_tokens_est / summary_tokens_est). cmd_compress still reads
the old names, so mempalace compress throws KeyError on the first
drawer it touches and the feature is effectively dead.

Also fix the summary line at the bottom of cmd_compress. It called
count_tokens("x" * total_original), but count_tokens is word-based
(max(1, int(len(text.split()) * 1.3))), and a string of repeated
xs is a single "word", so both totals were always 1. Accumulate
the per-drawer estimates during the main loop instead, and use a
token-based ratio so the summary line is self-consistent with the
per-drawer dry-run output.

The storage metadata key names on the compressed collection
(compression_ratio, original_tokens) stay the same for compatibility
with anything already reading them. Only the source of the values
is updated.

Fixes MemPalace#159 (points 1 and 2)
milla-jovovich pushed a commit that referenced this pull request Apr 9, 2026
PyPI release cut covering 39 merged PRs since v3.0.0 on 2026-04-06.
Highlights: Claude/Codex plugin packaging (#270), security hardening (#387),
honest AAAK stats + benchmark corrections (#147), Windows compatibility fixes,
Knowledge Graph WAL mode + batching, 10K limit safety caps, and much more.

See GitHub release notes for full changelog.
@milla-jovovich milla-jovovich mentioned this pull request Apr 9, 2026
6 tasks
mvalentsev added a commit to mvalentsev/mempalace that referenced this pull request Apr 9, 2026
The honest-stats rename in PR MemPalace#147 changed the keys returned by
Dialect.compression_stats() (ratio -> size_ratio, compressed_chars ->
summary_chars, original_tokens / compressed_tokens ->
original_tokens_est / summary_tokens_est). cmd_compress still reads
the old names, so mempalace compress throws KeyError on the first
drawer it touches and the feature is effectively dead.

Also fix the summary line at the bottom of cmd_compress. It called
count_tokens("x" * total_original), but count_tokens is word-based
(max(1, int(len(text.split()) * 1.3))), and a string of repeated
xs is a single "word", so both totals were always 1. Accumulate
the per-drawer estimates during the main loop instead, and use a
token-based ratio so the summary line is self-consistent with the
per-drawer dry-run output.

The storage metadata key names on the compressed collection
(compression_ratio, original_tokens) stay the same for compatibility
with anything already reading them. Only the source of the values
is updated.

Fixes MemPalace#159 (points 1 and 2)
milla-jovovich added a commit that referenced this pull request Apr 9, 2026
PyPI release cut covering 39 merged PRs since v3.0.0 on 2026-04-06.
Highlights: Claude/Codex plugin packaging (#270), security hardening (#387),
honest AAAK stats + benchmark corrections (#147), Windows compatibility fixes,
Knowledge Graph WAL mode + batching, 10K limit safety caps, and much more.

See GitHub release notes for full changelog.

Co-authored-by: milla-jovovich <[email protected]>
mvalentsev added a commit to mvalentsev/mempalace that referenced this pull request Apr 9, 2026
The honest-stats rename in PR MemPalace#147 changed the keys returned by
Dialect.compression_stats() (ratio -> size_ratio, compressed_chars ->
summary_chars, original_tokens / compressed_tokens ->
original_tokens_est / summary_tokens_est). cmd_compress still reads
the old names, so mempalace compress throws KeyError on the first
drawer it touches and the feature is effectively dead.

Also fix the summary line at the bottom of cmd_compress. It called
count_tokens("x" * total_original), but count_tokens is word-based
(max(1, int(len(text.split()) * 1.3))), and a string of repeated
xs is a single "word", so both totals were always 1. Accumulate
the per-drawer estimates during the main loop instead, and use a
token-based ratio so the summary line is self-consistent with the
per-drawer dry-run output.

The storage metadata key names on the compressed collection
(compression_ratio, original_tokens) stay the same for compatibility
with anything already reading them. Only the source of the values
is updated.

Fixes MemPalace#159 (points 1 and 2)
mvalentsev added a commit to mvalentsev/mempalace that referenced this pull request Apr 10, 2026
The honest-stats rename in PR MemPalace#147 changed the keys returned by
Dialect.compression_stats() (ratio -> size_ratio, compressed_chars ->
summary_chars, original_tokens / compressed_tokens ->
original_tokens_est / summary_tokens_est). cmd_compress still reads
the old names, so mempalace compress throws KeyError on the first
drawer it touches and the feature is effectively dead.

Also fix the summary line at the bottom of cmd_compress. It called
count_tokens("x" * total_original), but count_tokens is word-based
(max(1, int(len(text.split()) * 1.3))), and a string of repeated
xs is a single "word", so both totals were always 1. Accumulate
the per-drawer estimates during the main loop instead, and use a
token-based ratio so the summary line is self-consistent with the
per-drawer dry-run output.

The storage metadata key names on the compressed collection
(compression_ratio, original_tokens) stay the same for compatibility
with anything already reading them. Only the source of the values
is updated.

Fixes MemPalace#159 (points 1 and 2)
mvalentsev added a commit to mvalentsev/mempalace that referenced this pull request Apr 10, 2026
The new test_cmd_compress_dry_run and test_cmd_compress_stores_results
tests (added upstream after this branch diverged) mock
compression_stats() with the old key names. Update the mocks to use
the post-MemPalace#147 keys (original_tokens_est, summary_tokens_est,
size_ratio, summary_chars) so they match what the fixed cmd_compress
actually reads.
phobicdotno pushed a commit to phobicdotno/mempalace-gpu that referenced this pull request Apr 10, 2026
PyPI release cut covering 39 merged PRs since v3.0.0 on 2026-04-06.
Highlights: Claude/Codex plugin packaging (MemPalace#270), security hardening (MemPalace#387),
honest AAAK stats + benchmark corrections (MemPalace#147), Windows compatibility fixes,
Knowledge Graph WAL mode + batching, 10K limit safety caps, and much more.

See GitHub release notes for full changelog.

Co-authored-by: milla-jovovich <[email protected]>
mvalentsev added a commit to mvalentsev/mempalace that referenced this pull request Apr 10, 2026
The honest-stats rename in PR MemPalace#147 changed the keys returned by
Dialect.compression_stats() (ratio -> size_ratio, compressed_chars ->
summary_chars, original_tokens / compressed_tokens ->
original_tokens_est / summary_tokens_est). cmd_compress still reads
the old names, so mempalace compress throws KeyError on the first
drawer it touches and the feature is effectively dead.

Also fix the summary line at the bottom of cmd_compress. It called
count_tokens("x" * total_original), but count_tokens is word-based
(max(1, int(len(text.split()) * 1.3))), and a string of repeated
xs is a single "word", so both totals were always 1. Accumulate
the per-drawer estimates during the main loop instead, and use a
token-based ratio so the summary line is self-consistent with the
per-drawer dry-run output.

The storage metadata key names on the compressed collection
(compression_ratio, original_tokens) stay the same for compatibility
with anything already reading them. Only the source of the values
is updated.

Fixes MemPalace#159 (points 1 and 2)
mvalentsev added a commit to mvalentsev/mempalace that referenced this pull request Apr 10, 2026
The new test_cmd_compress_dry_run and test_cmd_compress_stores_results
tests (added upstream after this branch diverged) mock
compression_stats() with the old key names. Update the mocks to use
the post-MemPalace#147 keys (original_tokens_est, summary_tokens_est,
size_ratio, summary_chars) so they match what the fixed cmd_compress
actually reads.
@bensig bensig deleted the fix/aaak-honest-stats branch April 10, 2026 16:26
mvalentsev added a commit to mvalentsev/mempalace that referenced this pull request Apr 10, 2026
The honest-stats rename in PR MemPalace#147 changed the keys returned by
Dialect.compression_stats() (ratio -> size_ratio, compressed_chars ->
summary_chars, original_tokens / compressed_tokens ->
original_tokens_est / summary_tokens_est). cmd_compress still reads
the old names, so mempalace compress throws KeyError on the first
drawer it touches and the feature is effectively dead.

Also fix the summary line at the bottom of cmd_compress. It called
count_tokens("x" * total_original), but count_tokens is word-based
(max(1, int(len(text.split()) * 1.3))), and a string of repeated
xs is a single "word", so both totals were always 1. Accumulate
the per-drawer estimates during the main loop instead, and use a
token-based ratio so the summary line is self-consistent with the
per-drawer dry-run output.

The storage metadata key names on the compressed collection
(compression_ratio, original_tokens) stay the same for compatibility
with anything already reading them. Only the source of the values
is updated.

Fixes MemPalace#159 (points 1 and 2)
mvalentsev added a commit to mvalentsev/mempalace that referenced this pull request Apr 10, 2026
The new test_cmd_compress_dry_run and test_cmd_compress_stores_results
tests (added upstream after this branch diverged) mock
compression_stats() with the old key names. Update the mocks to use
the post-MemPalace#147 keys (original_tokens_est, summary_tokens_est,
size_ratio, summary_chars) so they match what the fixed cmd_compress
actually reads.
mvalentsev added a commit to mvalentsev/mempalace that referenced this pull request Apr 11, 2026
The honest-stats rename in PR MemPalace#147 changed the keys returned by
Dialect.compression_stats() (ratio -> size_ratio, compressed_chars ->
summary_chars, original_tokens / compressed_tokens ->
original_tokens_est / summary_tokens_est). cmd_compress still reads
the old names, so mempalace compress throws KeyError on the first
drawer it touches and the feature is effectively dead.

Also fix the summary line at the bottom of cmd_compress. It called
count_tokens("x" * total_original), but count_tokens is word-based
(max(1, int(len(text.split()) * 1.3))), and a string of repeated
xs is a single "word", so both totals were always 1. Accumulate
the per-drawer estimates during the main loop instead, and use a
token-based ratio so the summary line is self-consistent with the
per-drawer dry-run output.

The storage metadata key names on the compressed collection
(compression_ratio, original_tokens) stay the same for compatibility
with anything already reading them. Only the source of the values
is updated.

Fixes MemPalace#159 (points 1 and 2)
mvalentsev added a commit to mvalentsev/mempalace that referenced this pull request Apr 11, 2026
The new test_cmd_compress_dry_run and test_cmd_compress_stores_results
tests (added upstream after this branch diverged) mock
compression_stats() with the old key names. Update the mocks to use
the post-MemPalace#147 keys (original_tokens_est, summary_tokens_est,
size_ratio, summary_chars) so they match what the fixed cmd_compress
actually reads.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Incorrect token estimate for AAAK

1 participant