Skip to content

First search is silent for 30s+ while embedding/reranker models download — surface progress #696

@memtomem

Description

@memtomem

What

The first search after mm web boot can take 30s to several minutes with no
UI feedback while fastembed downloads and loads the configured ONNX model
(and reranker, if enabled). The Search button shows the standard spinner;
the user can't tell whether the request is hung, the model is loading from
cache, or a multi-GB download is in progress.

The issue is most painful for users who selected the multilingual stack at
mm init time:

  • BAAI/bge-m3 — ~2.3 GB ONNX bundle.
  • jinaai/jina-reranker-v2-base-multilingual — ~1.1 GB.

These are explicit choices in the mm init wizard
(packages/memtomem/src/memtomem/cli/init_cmd.py:441,615), but the
download then happens silently on the first search — minutes after the
choice was made — so users often think the UI is broken and reload or kill
the server.

Where it bites

  • OnnxEmbedder._get_model()
    (packages/memtomem/src/memtomem/embedding/onnx.py:72) lazily instantiates
    fastembed.TextEmbedding(...) on first use. fastembed's constructor is
    what triggers the HuggingFace snapshot download, and it does not
    expose progress events the caller can subscribe to — it just blocks until
    the files are on disk.
  • A similar story holds for the cross-encoder reranker if enabled.
  • The Web UI's search request goes through pipeline.search
    (packages/memtomem/src/memtomem/search/pipeline.py:357) →
    embedder.embed_query_get_model() → fastembed download. From the
    UI's perspective it's a single long-running fetch with no intermediate
    signal.

/api/embedding-status exists already
(packages/memtomem/src/memtomem/web/routes/system.py:728) but is about
config-mismatch detection, not load state — it does not solve this.

Design options (to discuss before PR)

Three deliberate cuts, smallest first:

A. Generic "warming up" hint (zero-backend)

Detect when the first search exceeds, say, 3 seconds and replace the
spinner with a translated string like "First search after launch may
take a moment — loading models…"
. Pure frontend; no server changes; no
download-vs-load distinction.

  • Pros: ~1-day PR, no API surface, nothing to maintain.
  • Cons: Still silent for the multi-minute download case. User does not
    learn how long to wait or what's happening.

B. New readiness endpoint + banner (recommended starting point)

Add GET /api/system/embedding-readiness returning something like:

{
  "state": "downloading" | "loading" | "ready" | "error",
  "model": "BAAI/bge-m3",
  "approx_size_mb": 2300,
  "cache_present": true,
  "reranker": { "state": "ready", "model": "..." }
}

state derivation:

  • cache_present → check the resolved fastembed cache dir
    (fastembed_cache.resolve_fastembed_cache_dir()) for the model's
    snapshot directory.
  • state="downloading" when the snapshot dir is missing and a download
    task is in flight.
  • state="loading" when the snapshot dir is present but _get_model()
    hasn't returned yet.
  • state="ready" after the first successful _get_model().

Wire a small banner in the SPA header that polls this endpoint while
non-ready and shows e.g. "Downloading bge-m3 (~2.3 GB). First search
will start when ready."

  • Pros: Concrete user signal; survives the multi-minute download case.
    No streaming or new dependencies.
  • Cons: No actual progress percentage — fastembed doesn't expose it.
    States are coarse-grained.

C. Streamed download progress (overkill for v0.1.x)

Wrap huggingface_hub.snapshot_download directly to capture per-file
progress and stream it via SSE. Bypasses fastembed's blocking constructor.

  • Pros: Real percent-complete bar.
  • Cons: Significant code surface — duplicates fastembed's resolution
    logic, has to track schema changes, and the value-add over (B) is
    marginal once the user knows roughly how long to wait.

Open questions

  • Should the banner also show before the first search (e.g. on
    mm web boot if the cache is empty), or only when blocking a request?
    Up-front warning is friendlier but adds a new always-present UI element.
  • Does the reranker get its own banner or share one? They can download in
    parallel and finish at different times.
  • Translations — banner copy needs en + ko and probably a model-name
    • size template. New home.banner.embedding_* keys.

Severity

Medium UX. Not data loss; not a regression in the strict sense (the
download has always been silent). But it is the single most common "first
launch is broken" perception, and the multilingual stack — which is the
default Korean / multilingual choice in mm init — exposes it most.

Suggested next step

Pick option (B), draft a small RFC or PR for the endpoint + banner with
explicit answers to the open questions above.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions