Skip to content

refactor: consolidate pagination into shared iter_metadatas() (#171)#482

Closed
Formatted wants to merge 6 commits intoMemPalace:developfrom
Formatted:fix/171-miner-status-pagination
Closed

refactor: consolidate pagination into shared iter_metadatas() (#171)#482
Formatted wants to merge 6 commits intoMemPalace:developfrom
Formatted:fix/171-miner-status-pagination

Conversation

@Formatted
Copy link
Copy Markdown

@Formatted Formatted commented Apr 10, 2026

What does this PR do?

Started as a CLI-path fix for the 10k truncation bug in miner.status() (#171), then evolved — per reviewer suggestion — into a codebase-wide consolidation.

Before: three separate pagination implementations:

  • mcp_server.py — private _iter_metadatas() generator
  • palace_graph.py — inline while offset < total loop with col.get(limit=1000)
  • miner.py — single col.get(limit=10000) call (the original bug)

After: one shared iter_metadatas(col, where=None, batch=500) generator in palace.py, alongside get_collection and file_already_mined. All three callers import and use it. Single place to tune batch size or fix edge cases going forward.

How to test

pip install -e ".[dev]"
python -m pytest tests/test_miner.py::test_status_paginates_beyond_single_page -v
python -m pytest tests/ -v

The new test mocks a 1100-drawer palace across three pages (500 + 500 + 100) and asserts the correct total is printed and that col.get was called three times with advancing offsets.

Checklist

  • Tests pass (python -m pytest tests/ -v)
  • No hardcoded paths
  • Linter passes (ruff check .)

Related

@Formatted Formatted force-pushed the fix/171-miner-status-pagination branch from 84438f6 to 1d77784 Compare April 10, 2026 02:15
Copy link
Copy Markdown

@web3guru888 web3guru888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fixes a real pain point. We hit the 10k truncation on our integration (710+ KG entities, growing) and had to work around it.

_iter_metadatas generator — Clean pattern. The batch-of-500 with early termination on short page is correct. Good defensive handling of None metadatas — we've seen ChromaDB return that on freshly-created collections.

partial flag in status() — Smart addition. Having the response self-report when it couldn't scan all metadata is useful for clients. One thought: when partial is true, would it also make sense to include total_from_meta in the response so the caller can see how many were actually scanned vs total_drawers?

Miner-side pagination — The parallel fix in miner.py is necessary too. Only note: unlike the MCP side which uses the shared _iter_metadatas helper, the miner reimplements its own pagination loop with _PAGE = 500. Consider extracting a shared utility to avoid the two loops drifting apart — but that's a follow-up, not a blocker for this PR.

Tests — The _TAXONOMY_BATCH=2 monkeypatch to force multi-page iteration on a 4-item seeded collection is clever. The None metadatas edge case test is particularly valuable.

Strong +1. This is the kind of fix that prevents silent data loss at scale.

🔭 Reviewed as part of the MemPalace-AGI integration project — autonomous research with perfect memory. Community interaction updates are posted regularly on the dashboard.

@Formatted
Copy link
Copy Markdown
Author

Good call on the shared utility — went ahead and did the refactor. Extracted the pagination loop into palace.py as iter_metadatas(col, where=None, batch=500), alongside get_collection and file_already_mined (that module's own docstring already calls it out as the home for shared ChromaDB patterns). miner.status() now just iterates iter_metadatas(col) — the inline loop is gone. mcp_server.py can swap its own _iter_all_metadatas for the shared one once #307 merges.

Copy link
Copy Markdown

@web3guru888 web3guru888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is exactly the right fix. We were running with the 10k truncation bug for months before we caught it — wrong wing/room counts in mempalace status that were just plausible enough to not trigger suspicion.

The pagination approach (offset loop in batches) is consistent with what palace_graph.py already does; good that you matched the pattern rather than introducing a new one.

One observation: mcp_server.py tool_status, tool_list_wings, tool_list_rooms, and tool_get_taxonomy all have the same uncapped col.get(limit=10000) call — tracked in #478. This PR handles the CLI/miner path; those four MCP tools are still outstanding. Might be worth a follow-up PR or a note in this one to make the scope explicit.

Tests look appropriate. LGTM for the miner.py pagination path.

@Formatted Formatted force-pushed the fix/171-miner-status-pagination branch from 62c984a to 1ba5ab8 Compare April 10, 2026 05:49
@Formatted
Copy link
Copy Markdown
Author

The MCP tools are covered — #493 fixes all four (tool_status, tool_list_wings, tool_list_rooms, tool_get_taxonomy) using the same paginated offset pattern. This PR stays scoped to the CLI/miner path.

@web3guru888
Copy link
Copy Markdown

Got it — good to know #493 covers the MCP side. The split makes sense: fixing the CLI path here and the MCP tools in a focused PR there is cleaner than trying to do it all in one. Both can land independently.

@Formatted Formatted changed the title fix(miner): paginate status() to avoid 10k ChromaDB truncation (#171) refactor: consolidate pagination into shared iter_metadatas() (#171) Apr 10, 2026
@Formatted
Copy link
Copy Markdown
Author

Scope has expanded since that comment. After the discussion on #478, went ahead and did the full consolidation: removed _iter_metadatas from mcp_server.py and the inline loop from palace_graph.py — both now use iter_metadatas() from palace.py. Single implementation across the whole codebase.

@web3guru888
Copy link
Copy Markdown

That's a cleaner outcome than the scoped fix — single iter_metadatas() implementation across the whole codebase means there's exactly one place to update when ChromaDB changes its pagination API again (and it will). Good call on expanding scope after the #478 discussion.

Does this now supersede the pagination work in PR #563 (the col.count() > 10000 dynamic limit fix), or are they complementary? Asking because both seem to be addressing the 10K wall from different angles — #563 fixes the limit query, this PR fixes the shared implementation. Might be worth coordinating to avoid merge conflicts.

@Formatted
Copy link
Copy Markdown
Author

Looked at #563 — they conflict rather than complement. #563 replaces limit=10000 with limit=max(col.count(), 1) in mcp_server.py, which is a single unbounded fetch that scales with palace size. Our PR already removed those lines in favour of iter_metadatas(). Beyond the merge conflict, the max(col.count(), 1) approach trades one problem for another — on a 125k-drawer palace it fetches all 125k records in a single col.get() call, which is the kind of memory spike the pagination is meant to avoid.

@web3guru888
Copy link
Copy Markdown

Got it — #482 already removed those lines in favour of iter_metadatas(), so #563's max(col.count(), 1) patch is patching code that doesn't exist in this branch. The pagination approach is strictly better for large palaces anyway; the single-fetch fix just moves the memory wall up rather than removing it.

@Formatted
Copy link
Copy Markdown
Author

@milla-jovovich @bensig — rebased and conflict-free. Quick summary of where this stands:

Formatted and others added 6 commits April 11, 2026 08:58
- add partial flag to tool_status when metadata count diverges from col.count()
- fix `if where` → `if where is not None` in _iter_metadatas
- add TestTaxonomyPagination covering pagination for all 4 tools, partial flag, and tool_list_rooms with where filter

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
…lace#171)

Replaces the single col.get(limit=10000) call in status() with a
500-item pagination loop, matching the fix applied to MCP taxonomy
tools in MemPalace#307. Palaces with >10k drawers now report correct drawer
counts instead of silently truncating at 10k.
Moves the pagination loop from miner.status() into palace.py alongside
get_collection and file_already_mined, so mcp_server.py can adopt the
same helper instead of maintaining a parallel implementation.
Removes the private _iter_metadatas from mcp_server.py and the inline
offset loop from palace_graph.py — both now use the shared
iter_metadatas() helper in palace.py. Single implementation, single
place to tune batch size or fix edge cases.
@Formatted Formatted force-pushed the fix/171-miner-status-pagination branch from 0e6500a to c4d4d7a Compare April 11, 2026 15:00
@Formatted
Copy link
Copy Markdown
Author

Rebased again onto upstream/main after PR #371 landed — that PR added inline pagination loops to mcp_server.py which conflicted with our iter_metadatas() consolidation. Resolved in favour of the shared utility, incorporating the "partial": true response key from #371's tool_status implementation. Conflicts clear, 590 tests passing.

@bensig bensig changed the base branch from main to develop April 11, 2026 22:22
@bensig bensig requested a review from igorls as a code owner April 11, 2026 22:22
@bensig
Copy link
Copy Markdown
Collaborator

bensig commented Apr 12, 2026

Closing — this is superseded by recently merged PRs to develop. Thank you for the contribution!

@bensig bensig closed this Apr 12, 2026
@Formatted Formatted deleted the fix/171-miner-status-pagination branch April 12, 2026 13:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

MCP server: mempalace_list_wings and mempalace_get_taxonomy return empty despite data in palace

3 participants