feat(wizard): explicit guardrails for failed init steps (closes #626) by memtomem · Pull Request #661 · memtomem/memtomem

memtomem · 2026-05-01T10:27:02Z

Summary

Adds StepRetry exception and fail_step() helper to wizard.py so a failed wizard step prompts the user to retry / back / quit instead of silently advancing.
Wires the new mechanism into _step_embedding's Ollama branch — server reachability check, model verification, and ollama pull are now guarded.
_ollama_has_model now returns bool | None (tri-state) so callers distinguish "server unreachable / errored" from "model absent". The original False-on-error conflation is the root cause of the [feature]: add guardrails for failed steps in the mm init wizard #626 misread (a server timeout was being interpreted as "model not installed", triggering a doomed pull and silent fall-through to the Reranker step).
Drops the misleading "Ollama not running. Starting 'ollama serve'..." line — the wizard never actually started the server, only printed the line.

Reproduction (the #626 scenario)

[3] Ollama → bge-m3 → "model not found" → pull fails → wizard moves on as if nothing happened

After this PR the user sees:

✗ Pull failed: Error: timed out waiting for server to start
Retry, back, or quit? [R/b/q]:

r re-runs only the pull (model selection is preserved). b returns to the previous step. q cancels the wizard via the same path as the q shortcut.

Incidental scope

The OpenAI branch's "API key invalid + decline to continue" path used raise SystemExit(1); switched to WizardCancel for uniform cancel UX. Exit code on this path changes from 1 to 0 — this is a deliberate cancel by the user, not an error. No tests pinned the previous exit code.

Out of scope (follow-up)

Silent-continue audit of other steps (_step_memory_dir dir creation, _step_settings writes, etc.). The mechanism added here can be reused per step in a follow-up PR.
Automatic backoff retry on transient failures.
Actually starting ollama serve from the wizard.

Test plan

uv run ruff check + format --check packages/memtomem/{src,tests} clean
uv run pytest -m "not ollama" — 3873 passed
New unit tests pin the StepRetry / fail_step choice→exception mapping (12 cases) and the _ollama_has_model tri-state contract (5 cases)
New integration tests pin the [feature]: add guardrails for failed steps in the mm init wizard #626 invariant: _step_embedding does NOT silently advance to the next step on Ollama failure (sentinel next-step check)
Manual pty-driven smoke against mm init --advanced with a fake ollama binary, in 5 scenarios:
- A (server unreachable + q) — red ✗ Ollama server is not reachable. prompt fires, wizard cancels (exit 0). The Starting 'ollama serve'... lie is gone.
- B (server unreachable + b) — falls back to (already at first step), re-runs _step_embedding from the top. Wizard never advances past it.
- C (server unreachable + r with server still broken + q) — outer retry loop re-runs the server check only; the embedding-provider menu does NOT re-render.
- D (server up, model missing, pull fails + r + still fails + q) — inner retry loop re-runs only the pull action; the model-selection menu is NOT re-prompted.
- E (pull fails + b) — wizard returns to the embedding choice; _step_reranker (next step) never runs. This is the core [feature]: add guardrails for failed steps in the mm init wizard #626 invariant.

🤖 Generated with Claude Code

The `mm init` wizard previously continued silently after a step failed, which is how the user reported scenario looked: a broken Ollama server caused `_ollama_has_model` to misread "server timed out" as "model not installed", the wizard offered a doomed `ollama pull`, the pull also failed, and the wizard then advanced straight to the Reranker step with no acknowledgement that anything was wrong. This change adds an explicit recovery prompt at each Ollama failure point. The user must now choose retry / back / quit — the wizard never silently moves past a failed action. Mechanism (`wizard.py`): - `StepRetry` exception, caught by `run_steps` to re-invoke the same step, or by the step itself to retry only an inner action without re-prompting earlier inputs. - `fail_step(message, *, retryable=True)` helper that surfaces a red failure message and prompts for r/b/q (or just b/q when `retryable=False`). Always raises — never returns. Wiring (`init_cmd.py`): - `_ollama_has_model` now returns `bool | None` so callers can distinguish "server unreachable / errored" (`None`) from "model absent" (`False`). The original `False`-on-error conflation was the root of the misread. - `_step_embedding` Ollama branch: server-reachability check, model verification, and `ollama pull` are each guarded with `fail_step`. The action region is wrapped in nested retry loops so the outer `StepRetry` re-runs server check + model select, and the inner `StepRetry` re-runs only the pull. - The misleading "Ollama not running. Starting 'ollama serve'..." message is dropped — the wizard never actually started the server, it only printed the line. The new path tells the user to start it themselves and offers retry once it is up. Incidental cleanup: the OpenAI branch's "API key invalid + decline to continue" path used `raise SystemExit(1)`. Switched to `WizardCancel` so cancellation is uniform with `q`. Exit code changes from 1 to 0 — this is a deliberate cancel, not an error. Out of scope (follow-up): silent-continue audit of other steps (`_step_memory_dir` dir creation, `_step_settings` writes), automatic backoff retry, and starting `ollama serve` from the wizard. Co-Authored-By: Claude <[email protected]>

…CHANGELOG Addresses self-review feedback on #661: - Drop the 2-line "Was raise SystemExit(1)" comment in init_cmd.py. CLAUDE.md prohibits removed-code comments; the context already lives in the commit history and PR body. - Add ``test_invalid_key_decline_continue_raises_wizard_cancel`` and ``test_invalid_key_decline_does_not_advance_to_next_step`` to pin the OpenAI bad-key + decline → ``WizardCancel`` change. The exit code 1 → 0 shift is now load-bearing in tests, so a future revert is loud. - Add ``test_has_model_none_r_retries_then_succeeds`` symmetric to the pull-failure retry test — covers the tri-state ``None`` retry path via the outer loop, complementing the existing ``b`` branch test. - Add a CHANGELOG entry under [Unreleased] / Changed documenting the wizard guardrail and the OpenAI exit-code behavior change so users monitoring exit codes notice on upgrade. Co-Authored-By: Claude <[email protected]>

memtomem · 2026-05-01T10:55:04Z

Admin-merging despite test (macos-latest) red.

Evidence the failure is pre-existing and unrelated to this PR:

3 consecutive macOS failures, all in packages/memtomem/tests/test_embedding_progress.py (introduced in feat(indexing): per-chunk progress events for SSE indexing stream #653):
1. test_onnx_streams_progress_per_yield[asyncio] — expected 5 calls, got []
2. test_onnx_streams_progress_per_yield[asyncio] — same assertion (rerun chore: prepare for open-source release #1)
3. test_onnx_throttles_thread_hops_for_large_input[asyncio] — expected ~20 throttled ticks for n=200, got 0
The same test family failed on the e034d10 main run (the very PR that introduced these tests). The pattern is on_progress callbacks not firing on the macOS GitHub runner with Python 3.14.
This PR's diff is scoped to cli/wizard.py, cli/init_cmd.py, the corresponding tests, and CHANGELOG.md — nothing touching ONNX, async progress, or embedding internals.
Local pytest on macOS passes the entire test_embedding_progress.py file (12/12).
All other required checks green: lint, notebooks, golden-path (ONNX bge-m3), test (ubuntu-latest), typecheck. Windows is not in the required list.

The flake is upstream of this PR and should be triaged separately.

pandas-studio and others added 2 commits May 1, 2026 19:26

memtomem merged commit 7c74c49 into main May 1, 2026
19 of 25 checks passed

memtomem deleted the fix/wizard-guardrails-626 branch May 1, 2026 10:55

github-actions Bot locked and limited conversation to collaborators May 1, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(wizard): explicit guardrails for failed init steps (closes #626)#661

feat(wizard): explicit guardrails for failed init steps (closes #626)#661
memtomem merged 2 commits intomainfrom
fix/wizard-guardrails-626

memtomem commented May 1, 2026 •

edited

Loading

Uh oh!

memtomem commented May 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

memtomem commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Reproduction (the #626 scenario)

Incidental scope

Out of scope (follow-up)

Test plan

Uh oh!

memtomem commented May 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

memtomem commented May 1, 2026 •

edited

Loading