feat(web): surface fastembed model load state via banner (#696)#703
feat(web): surface fastembed model load state via banner (#696)#703
Conversation
PR #703 review: ``embedding/onnx.py:_resolve_model`` (short-name → fastembed-id) was duplicated by a hand-rolled ``_resolve_fastembed_model_id`` in ``web/routes/system.py``, and the approximate-size map in ``embedding/readiness.py`` lived a third copy that drifted from both the wizard text in ``cli/init_cmd.py`` and fastembed's own ``size_in_GB`` metadata. Reviewer surfaced concrete drift on bge-m3: * fastembed metadata (and ``add_custom_model(size_in_gb=2.3)``): 2300 MB * readiness banner copy (correct): 2300 MB * init wizard text (wrong): 1.2 GB A user who picked bge-m3 after reading "~1.2 GB" in the wizard would then see "Downloading bge-m3 (~2300 MB)…" in the banner — a ~2× jump. ``bge-small-en-v1.5`` and ``all-MiniLM-L6-v2`` had similar mismatches. This commit: * Adds ``embedding/aliases.py`` as the single source of truth for short alias → (fastembed id, dim, MB) plus a separate reranker size table. Sizes match ``TextEmbedding.list_supported_models()`` / ``TextCrossEncoder.list_supported_models()`` exactly. Custom- registered models (just bge-m3 today) carry the size declared on their ``add_custom_model`` call. * Updates ``embedding/onnx.py`` to import ``resolve_embedder_id`` from aliases instead of carrying its own ``_ONNX_MODELS`` map. * Drops the duplicate ``_resolve_fastembed_model_id`` and the local ``_APPROX_SIZE_MB`` from ``web/routes/system.py`` and ``embedding/readiness.py``; both now read from aliases. * Updates the init wizard at ``cli/init_cmd.py`` to render sizes via ``aliases.format_size`` so the user-facing copy and the runtime banner are guaranteed to agree. * Corrects approximate sizes that were wrong even in the readiness table — bge-small-en-v1.5 67 MB (was 130), nomic-embed-text-v1.5 520 MB (was 280), jina-reranker-v2 1110 MB (was 1100). * Adds ``tests/test_embedding_aliases.py`` covering both directions of the lookup plus a snapshot test that pins the legacy short-name contract — a future refactor that re-introduces a private alias map fails this test instead of silently drifting. The remaining minor notes from the review (raw ``_load_error`` in API response, polling-cap silent stop, blob-completeness probe) are deliberately out of scope here — see the PR thread. Co-Authored-By: Claude <[email protected]>
|
Addressed both blocking review issues: #1 Size discrepancy — Used fastembed's
The wizard's "bge-m3 is ~1.2 GB (similar to Ollama models)" copy was also adjusted to "~2.3 GB (substantial download)" — calling 2.3 GB "similar to Ollama" was misleading. #2 Alias map dedup — Added Drift guard: Remaining minor notes are deliberately out of scope here:
Tests: 3923 passed (added 15 alias/dedup tests on top of the 3908 baseline). |
``OnnxEmbedder`` and ``FastEmbedReranker`` lazily instantiate fastembed models on first use. Until now there was no way to tell from outside the class whether a download was in flight, the model was loaded, or the cache was simply cold — fastembed itself does not surface progress, so the calling layer was blind for the entire 30-second-to-multi-minute window. Add two observability fields to both lazy loaders: * ``_loading: bool`` — True between entering ``_get_model()`` and the fastembed constructor returning. * ``_load_error: str | None`` — last failure message, set on exception inside the constructor and re-raised. Plain attribute reads/writes; bool/Optional[str] assignment is atomic under CPython, and the upcoming ``GET /api/system/model-readiness`` endpoint is allowed to observe transient states without taking a lock. Also add ``embedding/readiness.py`` with ``model_snapshot_present()`` — a filesystem-only check for whether a complete fastembed snapshot exists in the cache directory. The function walks ``cache_dir/models--<sanitized>/snapshots/`` and accepts the first subdirectory that contains ``config.json``, ``tokenizer.json``, and either flat ``model.onnx`` or nested ``onnx/model.onnx`` (fastembed uses the nested form for ``BAAI/bge-m3`` and the multilingual reranker; the flat form for the smaller English models). A small ``_APPROX_SIZE_MB`` map populated from the documented model list in ``cli/init_cmd.py`` lets the upcoming banner render "Downloading bge-m3 (~2.3 GB)…" without an extra network call. Co-Authored-By: Claude <[email protected]>
Adds a read-only endpoint the SPA can poll to populate a "Downloading
bge-m3 (~2.3 GB)…" / "Loading model…" banner instead of leaving the
user staring at a frozen Search button while a multi-GB fastembed
snapshot streams in.
Response covers both the embedder and the reranker:
```
GET /api/system/model-readiness
→ { embedder: {state, provider, model, cache_present, approx_size_mb, error},
reranker: {state, ...} // state="skipped" when rerank.enabled is False
}
```
State per component, derived from the ``_model`` / ``_loading`` /
``_load_error`` flags introduced in the previous commit plus a
filesystem probe of ``cache_dir/models--<sanitized>/snapshots/<sha>/``:
* ``ready`` — model loaded in memory.
* ``loading`` — cache present, constructor in flight.
* ``downloading`` — cache absent, constructor in flight.
* ``cold`` — nothing in flight (cache may or may not be present).
* ``error`` — last constructor attempt raised.
* ``skipped`` — provider routes through Ollama/Cohere/etc., or the
component is disabled.
Providers introspected through this endpoint are restricted to the
fastembed-backed paths (``"onnx"`` for the embedder, ``"fastembed"``
for the reranker). Ollama and Cohere have their own connection-based
readiness model and are reported as ``skipped`` — wiring them in
deserves a separate decision pass, not a quiet conflation here.
The endpoint never calls ``_get_model()`` itself, so polling it cannot
amplify load on a struggling installation. Cache-presence probes go
through ``model_snapshot_present`` which is filesystem-only.
Schema lives in ``web/schemas/config.py`` next to the existing
``EmbeddingStatusResponse`` / ``EmbeddingResetResponse`` so the
embedding-related types stay colocated. UI wiring lands in the next
commit.
Co-Authored-By: Claude <[email protected]>
Surfaces the readiness endpoint added in the previous commit as a header banner so a cold-cache install no longer leaves users staring at a frozen Search button while ``BAAI/bge-m3`` (~2.3 GB) streams in. Closes the user-visible half of #696. Banner copy is built from ``GET /api/system/model-readiness``: * Both components downloading → "Downloading bge-m3 (~2300 MB) and jina-reranker (~1100 MB)…" * One downloading → "Downloading bge-m3 (~2300 MB)…" (or the ``..._no_size`` variant for unknown models) * Loading from cache, no download → "Loading model…" * Hard error → "Model failed to load — check Settings." * Both ready / skipped → banner hidden, polling stops. Polling uses the same single-flight + setTimeout idiom as ``_indexingPollUntilIdle`` (4-second interval, capped at 200 ticks ≈ 13 min so a stuck server doesn't yield infinite background fetches). Three entry points kick the loop: 1. Boot — ``_modelReadinessHydrate()`` runs from the DOMContentLoaded handler. Fetches once; only starts continuous polling if at least one component is actively loading or has errored. 2. ``visibilitychange`` — re-hydrates when the tab regains focus so a load that finished while backgrounded doesn't leave the banner stuck up. 3. ``doSearch()`` pre-flight — kicks ``_modelReadinessPoll()`` on every search submission. The first tick may race the request and observe ``state="cold"``; ``cold`` is intentionally non-terminal here so the next tick catches the ``_loading=True`` flip on the server side. Five new ``banner.model_*`` i18n keys land in both ``en.json`` and ``ko.json`` plus two fallback name keys for use when the response omits the model identifier. The CSS reuses the visual language of ``.dev-mode-banner`` (accent-tinted background, single-row). ``index.html`` cache busters bumped: ``style.css?v=76→77`` (banner class added) and ``app.js?v=94→96`` (polling logic + ``doSearch`` pre-flight). The ``v=96`` jump leapfrogs an in-flight v=95 from a sibling PR; if that lands first, rebase will need a bump. Co-Authored-By: Claude <[email protected]>
PR #703 review: ``embedding/onnx.py:_resolve_model`` (short-name → fastembed-id) was duplicated by a hand-rolled ``_resolve_fastembed_model_id`` in ``web/routes/system.py``, and the approximate-size map in ``embedding/readiness.py`` lived a third copy that drifted from both the wizard text in ``cli/init_cmd.py`` and fastembed's own ``size_in_GB`` metadata. Reviewer surfaced concrete drift on bge-m3: * fastembed metadata (and ``add_custom_model(size_in_gb=2.3)``): 2300 MB * readiness banner copy (correct): 2300 MB * init wizard text (wrong): 1.2 GB A user who picked bge-m3 after reading "~1.2 GB" in the wizard would then see "Downloading bge-m3 (~2300 MB)…" in the banner — a ~2× jump. ``bge-small-en-v1.5`` and ``all-MiniLM-L6-v2`` had similar mismatches. This commit: * Adds ``embedding/aliases.py`` as the single source of truth for short alias → (fastembed id, dim, MB) plus a separate reranker size table. Sizes match ``TextEmbedding.list_supported_models()`` / ``TextCrossEncoder.list_supported_models()`` exactly. Custom- registered models (just bge-m3 today) carry the size declared on their ``add_custom_model`` call. * Updates ``embedding/onnx.py`` to import ``resolve_embedder_id`` from aliases instead of carrying its own ``_ONNX_MODELS`` map. * Drops the duplicate ``_resolve_fastembed_model_id`` and the local ``_APPROX_SIZE_MB`` from ``web/routes/system.py`` and ``embedding/readiness.py``; both now read from aliases. * Updates the init wizard at ``cli/init_cmd.py`` to render sizes via ``aliases.format_size`` so the user-facing copy and the runtime banner are guaranteed to agree. * Corrects approximate sizes that were wrong even in the readiness table — bge-small-en-v1.5 67 MB (was 130), nomic-embed-text-v1.5 520 MB (was 280), jina-reranker-v2 1110 MB (was 1100). * Adds ``tests/test_embedding_aliases.py`` covering both directions of the lookup plus a snapshot test that pins the legacy short-name contract — a future refactor that re-introduces a private alias map fails this test instead of silently drifting. The remaining minor notes from the review (raw ``_load_error`` in API response, polling-cap silent stop, blob-completeness probe) are deliberately out of scope here — see the PR thread. Co-Authored-By: Claude <[email protected]>
3090824 to
aab7e3f
Compare
Closes #696 (option B).
Summary
GET /api/system/model-readinessreporting per-component (embedder + reranker) load state derived from_loading/_load_errorflags newly attached toOnnxEmbedderandFastEmbedReranker, plus a filesystem probe of the fastembed cache.Downloading bge-m3 (~2300 MB)…/Loading model…/Model failed to load — check Settingsso the first search after a cold-cache boot no longer feels like a hung UI.visibilitychangere-hydrate, and adoSearch()pre-flight cover the three entry points.Why option B
fastembeddoes not expose snapshot-download progress events, so a real percent bar (option C) would mean wrappinghuggingface_hub.snapshot_downloadourselves and replicating fastembed's resolution logic. Option B is the smallest change that converts "frozen Search button" into "I see what's happening" without that complexity. See the issue body for the option A/B/C tradeoffs.Commits
Split for review; one PR for atomic ship:
feat(embedding): track loading state on lazy fastembed loaders–
_loading/_load_errorflags +embedding/readiness.pycache probe + size map.feat(web): add /api/system/model-readiness endpoint– Schema, route, endpoint tests. No UI.
feat(web): show model-loading banner with polling– Banner DOM + CSS, polling logic, i18n keys, cache busters.
State machine
{onnx, fastembed}OR component disabledskipped_load_errorseterror_model is not Noneready_loading=Trueand cache absentdownloading_loading=Trueand cache presentloadingcoldcoldis intentionally non-terminal fordoSearch()-initiated polls — the first tick can race the request and observecoldbefore the backend's lazy loader flips_loading=True. The boot hydrate does treatcoldas terminal so a fully-warm install doesn't run a needless 4 s/tick background loop.Out of scope (deliberate)
state="skipped"here.Cache-buster note
app.js?v=94→96andstyle.css?v=76→77. Thev=96jump leapfrogs an in-flightv=95from sibling PR #694 (namespace-tooltip). Whichever lands first, the second will need a rebase + bump.Test plan
Backend:
uv run pytest packages/memtomem/tests/test_embedding_readiness.py packages/memtomem/tests/test_web_model_readiness.py— 20 tests cover the cache probe + state machine.uv run pytest -m "not ollama"— 3908 passed locally.uv run ruff check packages/memtomem/src && uv run ruff format --check packages/memtomem/src packages/memtomem/testsclean.Manual UI (cold-cache reproduction):
Korean toggle: switch language and re-run; banner copy must come from
ko.jsonstrings.🤖 Generated with Claude Code