What
The first search after mm web boot can take 30s to several minutes with no
UI feedback while fastembed downloads and loads the configured ONNX model
(and reranker, if enabled). The Search button shows the standard spinner;
the user can't tell whether the request is hung, the model is loading from
cache, or a multi-GB download is in progress.
The issue is most painful for users who selected the multilingual stack at
mm init time:
BAAI/bge-m3 — ~2.3 GB ONNX bundle.
jinaai/jina-reranker-v2-base-multilingual — ~1.1 GB.
These are explicit choices in the mm init wizard
(packages/memtomem/src/memtomem/cli/init_cmd.py:441,615), but the
download then happens silently on the first search — minutes after the
choice was made — so users often think the UI is broken and reload or kill
the server.
Where it bites
OnnxEmbedder._get_model()
(packages/memtomem/src/memtomem/embedding/onnx.py:72) lazily instantiates
fastembed.TextEmbedding(...) on first use. fastembed's constructor is
what triggers the HuggingFace snapshot download, and it does not
expose progress events the caller can subscribe to — it just blocks until
the files are on disk.
- A similar story holds for the cross-encoder reranker if enabled.
- The Web UI's search request goes through
pipeline.search
(packages/memtomem/src/memtomem/search/pipeline.py:357) →
embedder.embed_query → _get_model() → fastembed download. From the
UI's perspective it's a single long-running fetch with no intermediate
signal.
/api/embedding-status exists already
(packages/memtomem/src/memtomem/web/routes/system.py:728) but is about
config-mismatch detection, not load state — it does not solve this.
Design options (to discuss before PR)
Three deliberate cuts, smallest first:
A. Generic "warming up" hint (zero-backend)
Detect when the first search exceeds, say, 3 seconds and replace the
spinner with a translated string like "First search after launch may
take a moment — loading models…". Pure frontend; no server changes; no
download-vs-load distinction.
- Pros: ~1-day PR, no API surface, nothing to maintain.
- Cons: Still silent for the multi-minute download case. User does not
learn how long to wait or what's happening.
B. New readiness endpoint + banner (recommended starting point)
Add GET /api/system/embedding-readiness returning something like:
{
"state": "downloading" | "loading" | "ready" | "error",
"model": "BAAI/bge-m3",
"approx_size_mb": 2300,
"cache_present": true,
"reranker": { "state": "ready", "model": "..." }
}
state derivation:
cache_present → check the resolved fastembed cache dir
(fastembed_cache.resolve_fastembed_cache_dir()) for the model's
snapshot directory.
state="downloading" when the snapshot dir is missing and a download
task is in flight.
state="loading" when the snapshot dir is present but _get_model()
hasn't returned yet.
state="ready" after the first successful _get_model().
Wire a small banner in the SPA header that polls this endpoint while
non-ready and shows e.g. "Downloading bge-m3 (~2.3 GB). First search
will start when ready."
- Pros: Concrete user signal; survives the multi-minute download case.
No streaming or new dependencies.
- Cons: No actual progress percentage — fastembed doesn't expose it.
States are coarse-grained.
C. Streamed download progress (overkill for v0.1.x)
Wrap huggingface_hub.snapshot_download directly to capture per-file
progress and stream it via SSE. Bypasses fastembed's blocking constructor.
- Pros: Real percent-complete bar.
- Cons: Significant code surface — duplicates fastembed's resolution
logic, has to track schema changes, and the value-add over (B) is
marginal once the user knows roughly how long to wait.
Open questions
- Should the banner also show before the first search (e.g. on
mm web boot if the cache is empty), or only when blocking a request?
Up-front warning is friendlier but adds a new always-present UI element.
- Does the reranker get its own banner or share one? They can download in
parallel and finish at different times.
- Translations — banner copy needs
en + ko and probably a model-name
- size template. New
home.banner.embedding_* keys.
Severity
Medium UX. Not data loss; not a regression in the strict sense (the
download has always been silent). But it is the single most common "first
launch is broken" perception, and the multilingual stack — which is the
default Korean / multilingual choice in mm init — exposes it most.
Suggested next step
Pick option (B), draft a small RFC or PR for the endpoint + banner with
explicit answers to the open questions above.
What
The first search after
mm webboot can take 30s to several minutes with noUI feedback while fastembed downloads and loads the configured ONNX model
(and reranker, if enabled). The Search button shows the standard spinner;
the user can't tell whether the request is hung, the model is loading from
cache, or a multi-GB download is in progress.
The issue is most painful for users who selected the multilingual stack at
mm inittime:BAAI/bge-m3— ~2.3 GB ONNX bundle.jinaai/jina-reranker-v2-base-multilingual— ~1.1 GB.These are explicit choices in the
mm initwizard(
packages/memtomem/src/memtomem/cli/init_cmd.py:441,615), but thedownload then happens silently on the first search — minutes after the
choice was made — so users often think the UI is broken and reload or kill
the server.
Where it bites
OnnxEmbedder._get_model()(
packages/memtomem/src/memtomem/embedding/onnx.py:72) lazily instantiatesfastembed.TextEmbedding(...)on first use. fastembed's constructor iswhat triggers the HuggingFace snapshot download, and it does not
expose progress events the caller can subscribe to — it just blocks until
the files are on disk.
pipeline.search(
packages/memtomem/src/memtomem/search/pipeline.py:357) →embedder.embed_query→_get_model()→ fastembed download. From theUI's perspective it's a single long-running fetch with no intermediate
signal.
/api/embedding-statusexists already(
packages/memtomem/src/memtomem/web/routes/system.py:728) but is aboutconfig-mismatch detection, not load state — it does not solve this.
Design options (to discuss before PR)
Three deliberate cuts, smallest first:
A. Generic "warming up" hint (zero-backend)
Detect when the first search exceeds, say, 3 seconds and replace the
spinner with a translated string like "First search after launch may
take a moment — loading models…". Pure frontend; no server changes; no
download-vs-load distinction.
learn how long to wait or what's happening.
B. New readiness endpoint + banner (recommended starting point)
Add
GET /api/system/embedding-readinessreturning something like:{ "state": "downloading" | "loading" | "ready" | "error", "model": "BAAI/bge-m3", "approx_size_mb": 2300, "cache_present": true, "reranker": { "state": "ready", "model": "..." } }statederivation:cache_present→ check the resolved fastembed cache dir(
fastembed_cache.resolve_fastembed_cache_dir()) for the model'ssnapshot directory.
state="downloading"when the snapshot dir is missing and a downloadtask is in flight.
state="loading"when the snapshot dir is present but_get_model()hasn't returned yet.
state="ready"after the first successful_get_model().Wire a small banner in the SPA header that polls this endpoint while
non-
readyand shows e.g. "Downloading bge-m3 (~2.3 GB). First searchwill start when ready."
No streaming or new dependencies.
States are coarse-grained.
C. Streamed download progress (overkill for v0.1.x)
Wrap
huggingface_hub.snapshot_downloaddirectly to capture per-fileprogress and stream it via SSE. Bypasses fastembed's blocking constructor.
logic, has to track schema changes, and the value-add over (B) is
marginal once the user knows roughly how long to wait.
Open questions
mm webboot if the cache is empty), or only when blocking a request?Up-front warning is friendlier but adds a new always-present UI element.
parallel and finish at different times.
en+koand probably a model-namehome.banner.embedding_*keys.Severity
Medium UX. Not data loss; not a regression in the strict sense (the
download has always been silent). But it is the single most common "first
launch is broken" perception, and the multilingual stack — which is the
default Korean / multilingual choice in
mm init— exposes it most.Suggested next step
Pick option (B), draft a small RFC or PR for the endpoint + banner with
explicit answers to the open questions above.