First search is silent for 30s+ while embedding/reranker models download — surface progress

## What

The first search after `mm web` boot can take 30s to several minutes with no
UI feedback while fastembed downloads and loads the configured ONNX model
(and reranker, if enabled). The Search button shows the standard spinner;
the user can't tell whether the request is hung, the model is loading from
cache, or a multi-GB download is in progress.

The issue is most painful for users who selected the multilingual stack at
`mm init` time:

- `BAAI/bge-m3` — ~2.3 GB ONNX bundle.
- `jinaai/jina-reranker-v2-base-multilingual` — ~1.1 GB.

These are explicit choices in the `mm init` wizard
(`packages/memtomem/src/memtomem/cli/init_cmd.py:441,615`), but the
download then happens silently on the *first* search — minutes after the
choice was made — so users often think the UI is broken and reload or kill
the server.

## Where it bites

- `OnnxEmbedder._get_model()`
  (`packages/memtomem/src/memtomem/embedding/onnx.py:72`) lazily instantiates
  `fastembed.TextEmbedding(...)` on first use. fastembed's constructor is
  what triggers the HuggingFace snapshot download, and it does **not**
  expose progress events the caller can subscribe to — it just blocks until
  the files are on disk.
- A similar story holds for the cross-encoder reranker if enabled.
- The Web UI's search request goes through `pipeline.search`
  (`packages/memtomem/src/memtomem/search/pipeline.py:357`) →
  `embedder.embed_query` → `_get_model()` → fastembed download. From the
  UI's perspective it's a single long-running fetch with no intermediate
  signal.

`/api/embedding-status` exists already
(`packages/memtomem/src/memtomem/web/routes/system.py:728`) but is about
config-mismatch detection, not load state — it does not solve this.

## Design options (to discuss before PR)

Three deliberate cuts, smallest first:

### A. Generic "warming up" hint (zero-backend)

Detect when the first search exceeds, say, 3 seconds and replace the
spinner with a translated string like **"First search after launch may
take a moment — loading models…"**. Pure frontend; no server changes; no
download-vs-load distinction.

* Pros: ~1-day PR, no API surface, nothing to maintain.
* Cons: Still silent for the multi-minute download case. User does not
  learn how long to wait or what's happening.

### B. New readiness endpoint + banner (recommended starting point)

Add `GET /api/system/embedding-readiness` returning something like:

```json
{
  "state": "downloading" | "loading" | "ready" | "error",
  "model": "BAAI/bge-m3",
  "approx_size_mb": 2300,
  "cache_present": true,
  "reranker": { "state": "ready", "model": "..." }
}
```

`state` derivation:

- `cache_present` → check the resolved fastembed cache dir
  (`fastembed_cache.resolve_fastembed_cache_dir()`) for the model's
  snapshot directory.
- `state="downloading"` when the snapshot dir is missing and a download
  task is in flight.
- `state="loading"` when the snapshot dir is present but `_get_model()`
  hasn't returned yet.
- `state="ready"` after the first successful `_get_model()`.

Wire a small banner in the SPA header that polls this endpoint while
non-`ready` and shows e.g. **"Downloading bge-m3 (~2.3 GB). First search
will start when ready."**

* Pros: Concrete user signal; survives the multi-minute download case.
  No streaming or new dependencies.
* Cons: No actual progress percentage — fastembed doesn't expose it.
  States are coarse-grained.

### C. Streamed download progress (overkill for v0.1.x)

Wrap `huggingface_hub.snapshot_download` directly to capture per-file
progress and stream it via SSE. Bypasses fastembed's blocking constructor.

* Pros: Real percent-complete bar.
* Cons: Significant code surface — duplicates fastembed's resolution
  logic, has to track schema changes, and the value-add over (B) is
  marginal once the user knows roughly how long to wait.

## Open questions

- Should the banner also show **before the first search** (e.g. on
  `mm web` boot if the cache is empty), or only when blocking a request?
  Up-front warning is friendlier but adds a new always-present UI element.
- Does the reranker get its own banner or share one? They can download in
  parallel and finish at different times.
- Translations — banner copy needs `en` + `ko` and probably a model-name
  + size template. New `home.banner.embedding_*` keys.

## Severity

Medium UX. Not data loss; not a regression in the strict sense (the
download has always been silent). But it is the single most common "first
launch is broken" perception, and the multilingual stack — which is the
default Korean / multilingual choice in `mm init` — exposes it most.

## Suggested next step

Pick option (B), draft a small RFC or PR for the endpoint + banner with
explicit answers to the open questions above.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

First search is silent for 30s+ while embedding/reranker models download — surface progress #696

What

Where it bites

Design options (to discuss before PR)

A. Generic "warming up" hint (zero-backend)

B. New readiness endpoint + banner (recommended starting point)

C. Streamed download progress (overkill for v0.1.x)

Open questions

Severity

Suggested next step

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

First search is silent for 30s+ while embedding/reranker models download — surface progress #696

Description

What

Where it bites

Design options (to discuss before PR)

A. Generic "warming up" hint (zero-backend)

B. New readiness endpoint + banner (recommended starting point)

C. Streamed download progress (overkill for v0.1.x)

Open questions

Severity

Suggested next step

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions