refactor: consolidate pagination into shared iter_metadatas() (#171)#482
refactor: consolidate pagination into shared iter_metadatas() (#171)#482Formatted wants to merge 6 commits intoMemPalace:developfrom
Conversation
84438f6 to
1d77784
Compare
web3guru888
left a comment
There was a problem hiding this comment.
This fixes a real pain point. We hit the 10k truncation on our integration (710+ KG entities, growing) and had to work around it.
_iter_metadatas generator — Clean pattern. The batch-of-500 with early termination on short page is correct. Good defensive handling of None metadatas — we've seen ChromaDB return that on freshly-created collections.
partial flag in status() — Smart addition. Having the response self-report when it couldn't scan all metadata is useful for clients. One thought: when partial is true, would it also make sense to include total_from_meta in the response so the caller can see how many were actually scanned vs total_drawers?
Miner-side pagination — The parallel fix in miner.py is necessary too. Only note: unlike the MCP side which uses the shared _iter_metadatas helper, the miner reimplements its own pagination loop with _PAGE = 500. Consider extracting a shared utility to avoid the two loops drifting apart — but that's a follow-up, not a blocker for this PR.
Tests — The _TAXONOMY_BATCH=2 monkeypatch to force multi-page iteration on a 4-item seeded collection is clever. The None metadatas edge case test is particularly valuable.
Strong +1. This is the kind of fix that prevents silent data loss at scale.
🔭 Reviewed as part of the MemPalace-AGI integration project — autonomous research with perfect memory. Community interaction updates are posted regularly on the dashboard.
|
Good call on the shared utility — went ahead and did the refactor. Extracted the pagination loop into |
web3guru888
left a comment
There was a problem hiding this comment.
This is exactly the right fix. We were running with the 10k truncation bug for months before we caught it — wrong wing/room counts in mempalace status that were just plausible enough to not trigger suspicion.
The pagination approach (offset loop in batches) is consistent with what palace_graph.py already does; good that you matched the pattern rather than introducing a new one.
One observation: mcp_server.py tool_status, tool_list_wings, tool_list_rooms, and tool_get_taxonomy all have the same uncapped col.get(limit=10000) call — tracked in #478. This PR handles the CLI/miner path; those four MCP tools are still outstanding. Might be worth a follow-up PR or a note in this one to make the scope explicit.
Tests look appropriate. LGTM for the miner.py pagination path.
62c984a to
1ba5ab8
Compare
|
The MCP tools are covered — #493 fixes all four ( |
|
Got it — good to know #493 covers the MCP side. The split makes sense: fixing the CLI path here and the MCP tools in a focused PR there is cleaner than trying to do it all in one. Both can land independently. |
|
Scope has expanded since that comment. After the discussion on #478, went ahead and did the full consolidation: removed |
e9c0393 to
48217c9
Compare
|
That's a cleaner outcome than the scoped fix — single Does this now supersede the pagination work in PR #563 (the |
|
Looked at #563 — they conflict rather than complement. #563 replaces |
|
Got it — #482 already removed those lines in favour of |
48217c9 to
0e6500a
Compare
|
@milla-jovovich @bensig — rebased and conflict-free. Quick summary of where this stands:
|
- add partial flag to tool_status when metadata count diverges from col.count() - fix `if where` → `if where is not None` in _iter_metadatas - add TestTaxonomyPagination covering pagination for all 4 tools, partial flag, and tool_list_rooms with where filter Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
…lace#171) Replaces the single col.get(limit=10000) call in status() with a 500-item pagination loop, matching the fix applied to MCP taxonomy tools in MemPalace#307. Palaces with >10k drawers now report correct drawer counts instead of silently truncating at 10k.
Moves the pagination loop from miner.status() into palace.py alongside get_collection and file_already_mined, so mcp_server.py can adopt the same helper instead of maintaining a parallel implementation.
Removes the private _iter_metadatas from mcp_server.py and the inline offset loop from palace_graph.py — both now use the shared iter_metadatas() helper in palace.py. Single implementation, single place to tune batch size or fix edge cases.
0e6500a to
c4d4d7a
Compare
|
Rebased again onto upstream/main after PR #371 landed — that PR added inline pagination loops to |
|
Closing — this is superseded by recently merged PRs to develop. Thank you for the contribution! |
What does this PR do?
Started as a CLI-path fix for the 10k truncation bug in
miner.status()(#171), then evolved — per reviewer suggestion — into a codebase-wide consolidation.Before: three separate pagination implementations:
mcp_server.py— private_iter_metadatas()generatorpalace_graph.py— inlinewhile offset < totalloop withcol.get(limit=1000)miner.py— singlecol.get(limit=10000)call (the original bug)After: one shared
iter_metadatas(col, where=None, batch=500)generator inpalace.py, alongsideget_collectionandfile_already_mined. All three callers import and use it. Single place to tune batch size or fix edge cases going forward.How to test
pip install -e ".[dev]" python -m pytest tests/test_miner.py::test_status_paginates_beyond_single_page -v python -m pytest tests/ -vThe new test mocks a 1100-drawer palace across three pages (500 + 500 + 100) and asserts the correct total is printed and that
col.getwas called three times with advancing offsets.Checklist
python -m pytest tests/ -v)ruff check .)Related