fix: correct token count estimate in compress summary#609
fix: correct token count estimate in compress summary#609bensig merged 1 commit intoMemPalace:developfrom
Conversation
web3guru888
left a comment
There was a problem hiding this comment.
The root cause is clear: building a string of N identical characters with no whitespace gives count_tokens() a single token every time. The char-to-token estimate (3.8 chars/token) is the right fix — it's a direct, constant-time calculation that matches the underlying heuristic without the allocation bug.
One note for whoever reviews after me: 3.8 chars/token is a reasonable English-prose average but it'll undercount tokens for code or heavily punctuated content (typically closer to 3–3.5 chars/token). The compress command is primarily used on prose summaries, so this is fine for the use case. If token precision ever becomes important here, the fix would be to pass a content_type hint to the estimate rather than using a single global ratio — but that's a separate concern and shouldn't block this PR.
LGTM.
[MemPalace-AGI integration — 215 tests, 710 KG entities]
…iterdir OSError) - Expand ~ in split command dir/output-dir args (upstream MemPalace#361) - Use yield+close in KG test fixture to prevent SQLite leaks (MemPalace#450) - Use char/3.8 token estimate for consistency (MemPalace#609) - Handle OSError from iterdir on broken symlinks/reparse points (MemPalace#558) Closes DOT-488.
…iterdir OSError) - Expand ~ in split command dir/output-dir args (upstream MemPalace#361) - Use yield+close in KG test fixture to prevent SQLite leaks (MemPalace#450) - Use char/3.8 token estimate for consistency (MemPalace#609) - Handle OSError from iterdir on broken symlinks/reparse points (MemPalace#558) Closes DOT-488.
…iterdir OSError) - Expand ~ in split command dir/output-dir args (upstream MemPalace#361) - Use yield+close in KG test fixture to prevent SQLite leaks (MemPalace#450) - Use char/3.8 token estimate for consistency (MemPalace#609) - Handle OSError from iterdir on broken symlinks/reparse points (MemPalace#558) Closes DOT-488.
Problem
cli.py:389-390— the compress command summary always prints1t -> 1tregardless of actual content size.Dialect.count_tokens("x" * total_original)builds a string of N identical characters with no whitespace.count_tokens()splits on spaces, gets 1 word, returnsmax(1, int(1 * 1.3))= 1 every time. Secondary issue: allocates a potentially huge string ("x" * total_original) for no reason.Fix
Replace with a direct char-to-token estimate using the same underlying heuristic (
~1.3 tokens/word,~3.8 chars/token):No string allocation, correct output.
Test plan
pytest tests/test_cli.py -v— all 40 tests passruff check+ruff formatcleanmempalace compress --dry-run --wing <wing>on a real palace and verify token counts are reasonable