Skip to content

feat(web): surface fastembed model load state via banner (#696)#703

Merged
memtomem merged 4 commits intomainfrom
feat/model-readiness-banner
May 2, 2026
Merged

feat(web): surface fastembed model load state via banner (#696)#703
memtomem merged 4 commits intomainfrom
feat/model-readiness-banner

Conversation

@memtomem
Copy link
Copy Markdown
Owner

@memtomem memtomem commented May 2, 2026

Closes #696 (option B).

Summary

  • Adds GET /api/system/model-readiness reporting per-component (embedder + reranker) load state derived from _loading / _load_error flags newly attached to OnnxEmbedder and FastEmbedReranker, plus a filesystem probe of the fastembed cache.
  • Wires a header banner that polls the endpoint and renders Downloading bge-m3 (~2300 MB)… / Loading model… / Model failed to load — check Settings so the first search after a cold-cache boot no longer feels like a hung UI.
  • Boot hydrate, visibilitychange re-hydrate, and a doSearch() pre-flight cover the three entry points.

Why option B

fastembed does not expose snapshot-download progress events, so a real percent bar (option C) would mean wrapping huggingface_hub.snapshot_download ourselves and replicating fastembed's resolution logic. Option B is the smallest change that converts "frozen Search button" into "I see what's happening" without that complexity. See the issue body for the option A/B/C tradeoffs.

Commits

Split for review; one PR for atomic ship:

  1. feat(embedding): track loading state on lazy fastembed loaders
    _loading / _load_error flags + embedding/readiness.py cache probe + size map.
  2. feat(web): add /api/system/model-readiness endpoint
    – Schema, route, endpoint tests. No UI.
  3. feat(web): show model-loading banner with polling
    – Banner DOM + CSS, polling logic, i18n keys, cache busters.

State machine

condition state
provider not in {onnx, fastembed} OR component disabled skipped
_load_error set error
_model is not None ready
_loading=True and cache absent downloading
_loading=True and cache present loading
otherwise cold

cold is intentionally non-terminal for doSearch()-initiated polls — the first tick can race the request and observe cold before the backend's lazy loader flips _loading=True. The boot hydrate does treat cold as terminal so a fully-warm install doesn't run a needless 4 s/tick background loop.

Out of scope (deliberate)

Cache-buster note

app.js?v=94→96 and style.css?v=76→77. The v=96 jump leapfrogs an in-flight v=95 from sibling PR #694 (namespace-tooltip). Whichever lands first, the second will need a rebase + bump.

Test plan

Backend:

  • uv run pytest packages/memtomem/tests/test_embedding_readiness.py packages/memtomem/tests/test_web_model_readiness.py — 20 tests cover the cache probe + state machine.
  • uv run pytest -m "not ollama" — 3908 passed locally.
  • uv run ruff check packages/memtomem/src && uv run ruff format --check packages/memtomem/src packages/memtomem/tests clean.

Manual UI (cold-cache reproduction):

mv ~/.memtomem/cache/fastembed ~/.memtomem/cache/fastembed.bak
uv run mm web --host 127.0.0.1 --port 8765 &
# Open the URL — banner is hidden because no search is in flight.
# Trigger a search — banner appears as "Downloading bge-m3 (~2300 MB)…"
# (assuming the user's config selects bge-m3). Banner persists through
# the load and disappears on completion. Search results render.
kill %1
mv ~/.memtomem/cache/fastembed.bak ~/.memtomem/cache/fastembed

Korean toggle: switch language and re-run; banner copy must come from ko.json strings.

🤖 Generated with Claude Code

memtomem pushed a commit that referenced this pull request May 2, 2026
PR #703 review: ``embedding/onnx.py:_resolve_model`` (short-name →
fastembed-id) was duplicated by a hand-rolled
``_resolve_fastembed_model_id`` in ``web/routes/system.py``, and the
approximate-size map in ``embedding/readiness.py`` lived a third copy
that drifted from both the wizard text in ``cli/init_cmd.py`` and
fastembed's own ``size_in_GB`` metadata.

Reviewer surfaced concrete drift on bge-m3:

* fastembed metadata (and ``add_custom_model(size_in_gb=2.3)``): 2300 MB
* readiness banner copy (correct):                                2300 MB
* init wizard text (wrong):                                       1.2 GB

A user who picked bge-m3 after reading "~1.2 GB" in the wizard would
then see "Downloading bge-m3 (~2300 MB)…" in the banner — a ~2× jump.
``bge-small-en-v1.5`` and ``all-MiniLM-L6-v2`` had similar mismatches.

This commit:

* Adds ``embedding/aliases.py`` as the single source of truth for short
  alias → (fastembed id, dim, MB) plus a separate reranker size table.
  Sizes match ``TextEmbedding.list_supported_models()`` /
  ``TextCrossEncoder.list_supported_models()`` exactly. Custom-
  registered models (just bge-m3 today) carry the size declared on
  their ``add_custom_model`` call.
* Updates ``embedding/onnx.py`` to import ``resolve_embedder_id`` from
  aliases instead of carrying its own ``_ONNX_MODELS`` map.
* Drops the duplicate ``_resolve_fastembed_model_id`` and the local
  ``_APPROX_SIZE_MB`` from ``web/routes/system.py`` and
  ``embedding/readiness.py``; both now read from aliases.
* Updates the init wizard at ``cli/init_cmd.py`` to render sizes via
  ``aliases.format_size`` so the user-facing copy and the runtime
  banner are guaranteed to agree.
* Corrects approximate sizes that were wrong even in the readiness
  table — bge-small-en-v1.5 67 MB (was 130), nomic-embed-text-v1.5
  520 MB (was 280), jina-reranker-v2 1110 MB (was 1100).
* Adds ``tests/test_embedding_aliases.py`` covering both directions
  of the lookup plus a snapshot test that pins the legacy short-name
  contract — a future refactor that re-introduces a private alias map
  fails this test instead of silently drifting.

The remaining minor notes from the review (raw ``_load_error`` in API
response, polling-cap silent stop, blob-completeness probe) are
deliberately out of scope here — see the PR thread.

Co-Authored-By: Claude <[email protected]>
@memtomem
Copy link
Copy Markdown
Owner Author

memtomem commented May 2, 2026

Addressed both blocking review issues:

#1 Size discrepancy — Used fastembed's list_supported_models() size_in_GB (and add_custom_model(size_in_gb=2.3) for bge-m3) as the source of truth. Corrected several wrong numbers in both directions:

model before after
BAAI/bge-m3 (init wizard) 1.2 GB 2.3 GB
BAAI/bge-small-en-v1.5 (init wizard) 33 MB 67 MB
BAAI/bge-small-en-v1.5 (readiness map) 130 MB 67 MB
sentence-transformers/all-MiniLM-L6-v2 (init wizard) 22 MB 90 MB
nomic-ai/nomic-embed-text-v1.5 (readiness map) 280 MB 520 MB
jinaai/jina-reranker-v2-base-multilingual (readiness map) 1100 MB 1110 MB

The wizard's "bge-m3 is ~1.2 GB (similar to Ollama models)" copy was also adjusted to "~2.3 GB (substantial download)" — calling 2.3 GB "similar to Ollama" was misleading.

#2 Alias map dedup — Added embedding/aliases.py as the single source of truth. OnnxEmbedder._resolve_model, _resolve_fastembed_model_id in routes, and the _APPROX_SIZE_MB map in readiness.py all gone. Init wizard reads sizes via aliases.format_size so wizard text and banner agree by construction.

Drift guard: tests/test_embedding_aliases.py::test_resolve_embedder_id_matches_legacy_resolver snapshots the short-name contract — a future refactor that reintroduces a private duplicate fails the test instead of silently drifting.

Remaining minor notes are deliberately out of scope here:

  • Raw _load_error in the API response — local-only tool, low risk; would muddy the schema if I started truncating.
  • Polling cap silent stop at 13 min — the cap is already a worst-case safeguard (visibilitychange re-arms); bumping it without changing behavior felt like noise.
  • Cache-presence probe doesn't verify blob completeness — the brief loading-instead-of-downloading UX wobble during the actual download isn't worth complicating the probe.

Tests: 3923 passed (added 15 alias/dedup tests on top of the 3908 baseline). ruff check + format --check clean.

pandas-studio and others added 4 commits May 2, 2026 13:10
``OnnxEmbedder`` and ``FastEmbedReranker`` lazily instantiate fastembed
models on first use. Until now there was no way to tell from outside the
class whether a download was in flight, the model was loaded, or the
cache was simply cold — fastembed itself does not surface progress, so
the calling layer was blind for the entire 30-second-to-multi-minute
window.

Add two observability fields to both lazy loaders:

* ``_loading: bool`` — True between entering ``_get_model()`` and the
  fastembed constructor returning.
* ``_load_error: str | None`` — last failure message, set on exception
  inside the constructor and re-raised.

Plain attribute reads/writes; bool/Optional[str] assignment is atomic
under CPython, and the upcoming ``GET /api/system/model-readiness``
endpoint is allowed to observe transient states without taking a lock.

Also add ``embedding/readiness.py`` with ``model_snapshot_present()`` —
a filesystem-only check for whether a complete fastembed snapshot exists
in the cache directory. The function walks
``cache_dir/models--<sanitized>/snapshots/`` and accepts the first
subdirectory that contains ``config.json``, ``tokenizer.json``, and
either flat ``model.onnx`` or nested ``onnx/model.onnx`` (fastembed uses
the nested form for ``BAAI/bge-m3`` and the multilingual reranker; the
flat form for the smaller English models).

A small ``_APPROX_SIZE_MB`` map populated from the documented model list
in ``cli/init_cmd.py`` lets the upcoming banner render
"Downloading bge-m3 (~2.3 GB)…" without an extra network call.

Co-Authored-By: Claude <[email protected]>
Adds a read-only endpoint the SPA can poll to populate a "Downloading
bge-m3 (~2.3 GB)…" / "Loading model…" banner instead of leaving the
user staring at a frozen Search button while a multi-GB fastembed
snapshot streams in.

Response covers both the embedder and the reranker:

```
GET /api/system/model-readiness
  → { embedder: {state, provider, model, cache_present, approx_size_mb, error},
      reranker: {state, ...}  // state="skipped" when rerank.enabled is False
    }
```

State per component, derived from the ``_model`` / ``_loading`` /
``_load_error`` flags introduced in the previous commit plus a
filesystem probe of ``cache_dir/models--<sanitized>/snapshots/<sha>/``:

* ``ready``       — model loaded in memory.
* ``loading``     — cache present, constructor in flight.
* ``downloading`` — cache absent, constructor in flight.
* ``cold``        — nothing in flight (cache may or may not be present).
* ``error``       — last constructor attempt raised.
* ``skipped``     — provider routes through Ollama/Cohere/etc., or the
                    component is disabled.

Providers introspected through this endpoint are restricted to the
fastembed-backed paths (``"onnx"`` for the embedder, ``"fastembed"``
for the reranker). Ollama and Cohere have their own connection-based
readiness model and are reported as ``skipped`` — wiring them in
deserves a separate decision pass, not a quiet conflation here.

The endpoint never calls ``_get_model()`` itself, so polling it cannot
amplify load on a struggling installation. Cache-presence probes go
through ``model_snapshot_present`` which is filesystem-only.

Schema lives in ``web/schemas/config.py`` next to the existing
``EmbeddingStatusResponse`` / ``EmbeddingResetResponse`` so the
embedding-related types stay colocated. UI wiring lands in the next
commit.

Co-Authored-By: Claude <[email protected]>
Surfaces the readiness endpoint added in the previous commit as a
header banner so a cold-cache install no longer leaves users staring
at a frozen Search button while ``BAAI/bge-m3`` (~2.3 GB) streams in.
Closes the user-visible half of #696.

Banner copy is built from ``GET /api/system/model-readiness``:

* Both components downloading → "Downloading bge-m3 (~2300 MB) and
  jina-reranker (~1100 MB)…"
* One downloading → "Downloading bge-m3 (~2300 MB)…" (or the
  ``..._no_size`` variant for unknown models)
* Loading from cache, no download → "Loading model…"
* Hard error → "Model failed to load — check Settings."
* Both ready / skipped → banner hidden, polling stops.

Polling uses the same single-flight + setTimeout idiom as
``_indexingPollUntilIdle`` (4-second interval, capped at 200 ticks ≈
13 min so a stuck server doesn't yield infinite background fetches).

Three entry points kick the loop:

1. Boot — ``_modelReadinessHydrate()`` runs from the DOMContentLoaded
   handler. Fetches once; only starts continuous polling if at least
   one component is actively loading or has errored.
2. ``visibilitychange`` — re-hydrates when the tab regains focus so a
   load that finished while backgrounded doesn't leave the banner
   stuck up.
3. ``doSearch()`` pre-flight — kicks ``_modelReadinessPoll()`` on
   every search submission. The first tick may race the request and
   observe ``state="cold"``; ``cold`` is intentionally non-terminal
   here so the next tick catches the ``_loading=True`` flip on the
   server side.

Five new ``banner.model_*`` i18n keys land in both ``en.json`` and
``ko.json`` plus two fallback name keys for use when the response
omits the model identifier. The CSS reuses the visual language of
``.dev-mode-banner`` (accent-tinted background, single-row).

``index.html`` cache busters bumped: ``style.css?v=76→77`` (banner
class added) and ``app.js?v=94→96`` (polling logic + ``doSearch``
pre-flight). The ``v=96`` jump leapfrogs an in-flight v=95 from a
sibling PR; if that lands first, rebase will need a bump.

Co-Authored-By: Claude <[email protected]>
PR #703 review: ``embedding/onnx.py:_resolve_model`` (short-name →
fastembed-id) was duplicated by a hand-rolled
``_resolve_fastembed_model_id`` in ``web/routes/system.py``, and the
approximate-size map in ``embedding/readiness.py`` lived a third copy
that drifted from both the wizard text in ``cli/init_cmd.py`` and
fastembed's own ``size_in_GB`` metadata.

Reviewer surfaced concrete drift on bge-m3:

* fastembed metadata (and ``add_custom_model(size_in_gb=2.3)``): 2300 MB
* readiness banner copy (correct):                                2300 MB
* init wizard text (wrong):                                       1.2 GB

A user who picked bge-m3 after reading "~1.2 GB" in the wizard would
then see "Downloading bge-m3 (~2300 MB)…" in the banner — a ~2× jump.
``bge-small-en-v1.5`` and ``all-MiniLM-L6-v2`` had similar mismatches.

This commit:

* Adds ``embedding/aliases.py`` as the single source of truth for short
  alias → (fastembed id, dim, MB) plus a separate reranker size table.
  Sizes match ``TextEmbedding.list_supported_models()`` /
  ``TextCrossEncoder.list_supported_models()`` exactly. Custom-
  registered models (just bge-m3 today) carry the size declared on
  their ``add_custom_model`` call.
* Updates ``embedding/onnx.py`` to import ``resolve_embedder_id`` from
  aliases instead of carrying its own ``_ONNX_MODELS`` map.
* Drops the duplicate ``_resolve_fastembed_model_id`` and the local
  ``_APPROX_SIZE_MB`` from ``web/routes/system.py`` and
  ``embedding/readiness.py``; both now read from aliases.
* Updates the init wizard at ``cli/init_cmd.py`` to render sizes via
  ``aliases.format_size`` so the user-facing copy and the runtime
  banner are guaranteed to agree.
* Corrects approximate sizes that were wrong even in the readiness
  table — bge-small-en-v1.5 67 MB (was 130), nomic-embed-text-v1.5
  520 MB (was 280), jina-reranker-v2 1110 MB (was 1100).
* Adds ``tests/test_embedding_aliases.py`` covering both directions
  of the lookup plus a snapshot test that pins the legacy short-name
  contract — a future refactor that re-introduces a private alias map
  fails this test instead of silently drifting.

The remaining minor notes from the review (raw ``_load_error`` in API
response, polling-cap silent stop, blob-completeness probe) are
deliberately out of scope here — see the PR thread.

Co-Authored-By: Claude <[email protected]>
@memtomem memtomem force-pushed the feat/model-readiness-banner branch from 3090824 to aab7e3f Compare May 2, 2026 04:10
@memtomem memtomem merged commit b2b24a3 into main May 2, 2026
8 of 9 checks passed
@github-actions github-actions Bot locked and limited conversation to collaborators May 2, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

First search is silent for 30s+ while embedding/reranker models download — surface progress

2 participants