Skip to content

Startup fail-fast when stored embedding dim=0 but provider != 'none' (legacy NoopEmbedder mismatch) #298

@memtomem

Description

@memtomem

Summary

When a DB was initialized with provider=none (NoopEmbedder, BM25-only mode) the stored _memtomem_meta.embedding_dimension is 0. If the config is later switched to a real provider (onnx/ollama) without running mm embedding-reset, startup silently loads the mismatch: runtime embedder produces real vectors, but chunks_vec was never created (only created when dimension > 0 at storage/sqlite_schema.py:169). Every subsequent upsert_chunks crashes with no such table: chunks_vec.

Propose: startup should fail fast (or surface a prominent warning) when stored.provider != 'none' but stored.dimension == 0, instead of silently loading into a broken state.

Repro

  1. mm init with provider=none (or an install that never ran an embedding-aware init).
  2. Edit ~/.memtomem/config.json to set embedding.provider=onnx, model=bge-m3, dimension=1024.
  3. Start mm web or mm serve — no error / warning.
  4. Trigger any indexing (mm index <path> or reindex via UI). Every file fails:
    ERROR  Indexing failed for <path>: upsert_chunks failed, transaction rolled back:
           no such table: chunks_vec
    

Observed 2026-04-19

Smoke on #295's initial-scan design flooded the log with ~200 identical "no such table: chunks_vec" lines before we diagnosed the root cause. User's _memtomem_meta had dim=0, provider=onnx, model=bge-m3 with 0 chunks populated — a contradictory combination that startup accepted without complaint.

Recovery (current): mm embedding-reset --mode apply-current drops chunks_vec if it exists, recreates it with the configured dimension, updates meta. Safe when chunks=0 (no data loss); destructive otherwise.

Proposed fix direction

In storage/sqlite_schema.py create_tables (around L140-L167 where stored provider/model are validated):

  1. Add a new validation branch: if stored_provider not in (None, 'none') AND stored_dim == 0, this is a legacy mismatch — surface it to the caller (StorageBackend.initialize) as a distinct error type.
  2. Either raise on startup with a clear remediation message ("DB has legacy NoopEmbedder meta (dim=0) but provider={provider}. Run 'mm embedding-reset --mode apply-current' to resolve.") or flag it as a startup warning + set a dim0_mismatch flag consumed by the embedding-status endpoint / web banner.

Fail-fast is probably the right call — silent loading produces hundreds of log lines before the user sees anything actionable.

Scope boundaries

  • Only affects startup code path. No schema migration needed; the fix is a gate + clear message.
  • Doesn't change mm embedding-reset behavior — it remains the recovery tool.
  • CLI / web both should emit the same diagnostic (consider moving the check into create_tables so it covers both).

Tests

  • Add a unit test that constructs a DB with _memtomem_meta = {dim:0, provider:onnx} and asserts StorageBackend.initialize() raises the new error type.
  • Add the reverse: DB with dim:0, provider:none should still initialize cleanly (that's the legitimate BM25-only case).

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions