fix: correct token count estimate in compress summary by arnoldwender · Pull Request #609 · MemPalace/mempalace

arnoldwender · 2026-04-11T07:16:20Z

Problem

cli.py:389-390 — the compress command summary always prints 1t -> 1t regardless of actual content size.

Dialect.count_tokens("x" * total_original) builds a string of N identical characters with no whitespace. count_tokens() splits on spaces, gets 1 word, returns max(1, int(1 * 1.3)) = 1 every time. Secondary issue: allocates a potentially huge string ("x" * total_original) for no reason.

Fix

Replace with a direct char-to-token estimate using the same underlying heuristic (~1.3 tokens/word, ~3.8 chars/token):

orig_tokens = max(1, int(total_original / 3.8))
comp_tokens = max(1, int(total_compressed / 3.8))

No string allocation, correct output.

Test plan

pytest tests/test_cli.py -v — all 40 tests pass
ruff check + ruff format clean
Run mempalace compress --dry-run --wing <wing> on a real palace and verify token counts are reasonable

web3guru888

The root cause is clear: building a string of N identical characters with no whitespace gives count_tokens() a single token every time. The char-to-token estimate (3.8 chars/token) is the right fix — it's a direct, constant-time calculation that matches the underlying heuristic without the allocation bug.

One note for whoever reviews after me: 3.8 chars/token is a reasonable English-prose average but it'll undercount tokens for code or heavily punctuated content (typically closer to 3–3.5 chars/token). The compress command is primarily used on prose summaries, so this is fine for the use case. If token precision ever becomes important here, the fix would be to pass a content_type hint to the estimate rather than using a single global ratio — but that's a separate concern and shouldn't block this PR.

LGTM.

[MemPalace-AGI integration — 215 tests, 710 KG entities]

…iterdir OSError) - Expand ~ in split command dir/output-dir args (upstream MemPalace#361) - Use yield+close in KG test fixture to prevent SQLite leaks (MemPalace#450) - Use char/3.8 token estimate for consistency (MemPalace#609) - Handle OSError from iterdir on broken symlinks/reparse points (MemPalace#558) Closes DOT-488.

fix: correct token count estimate in compress summary

e495bf6

arnoldwender requested review from bensig and milla-jovovich as code owners April 11, 2026 07:16

web3guru888 approved these changes Apr 11, 2026

View reviewed changes

bensig changed the base branch from main to develop April 11, 2026 22:21

bensig merged commit c4d8662 into MemPalace:develop Apr 11, 2026
6 checks passed

mvalentsev mentioned this pull request Apr 11, 2026

fix: restore mempalace compress after stats rename (#159) #162

Closed

4 tasks

NickShtefan pushed a commit to NickShtefan/mempalace that referenced this pull request Apr 12, 2026

fix: correct token count estimate in compress summary (MemPalace#609)

fb5ca3c

ichoosetoaccept mentioned this pull request Apr 13, 2026

fix: port small upstream fixes (split ~, KG fixture, token estimate, iterdir OSError) detailobsessed/mempalace#51

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: correct token count estimate in compress summary#609

fix: correct token count estimate in compress summary#609
bensig merged 1 commit intoMemPalace:developfrom
arnoldwender:fix/compress-token-count

arnoldwender commented Apr 11, 2026

Uh oh!

web3guru888 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

arnoldwender commented Apr 11, 2026

Problem

Fix

Test plan

Uh oh!

web3guru888 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants