Skip to content

feat(wizard): explicit guardrails for failed init steps (closes #626)#661

Merged
memtomem merged 2 commits intomainfrom
fix/wizard-guardrails-626
May 1, 2026
Merged

feat(wizard): explicit guardrails for failed init steps (closes #626)#661
memtomem merged 2 commits intomainfrom
fix/wizard-guardrails-626

Conversation

@memtomem
Copy link
Copy Markdown
Owner

@memtomem memtomem commented May 1, 2026

Summary

  • Adds StepRetry exception and fail_step() helper to wizard.py so a failed wizard step prompts the user to retry / back / quit instead of silently advancing.
  • Wires the new mechanism into _step_embedding's Ollama branch — server reachability check, model verification, and ollama pull are now guarded.
  • _ollama_has_model now returns bool | None (tri-state) so callers distinguish "server unreachable / errored" from "model absent". The original False-on-error conflation is the root cause of the [feature]: add guardrails for failed steps in the mm init wizard #626 misread (a server timeout was being interpreted as "model not installed", triggering a doomed pull and silent fall-through to the Reranker step).
  • Drops the misleading "Ollama not running. Starting 'ollama serve'..." line — the wizard never actually started the server, only printed the line.

Reproduction (the #626 scenario)

[3] Ollama → bge-m3 → "model not found" → pull fails → wizard moves on as if nothing happened

After this PR the user sees:

✗ Pull failed: Error: timed out waiting for server to start
Retry, back, or quit? [R/b/q]:

r re-runs only the pull (model selection is preserved). b returns to the previous step. q cancels the wizard via the same path as the q shortcut.

Incidental scope

The OpenAI branch's "API key invalid + decline to continue" path used raise SystemExit(1); switched to WizardCancel for uniform cancel UX. Exit code on this path changes from 1 to 0 — this is a deliberate cancel by the user, not an error. No tests pinned the previous exit code.

Out of scope (follow-up)

  • Silent-continue audit of other steps (_step_memory_dir dir creation, _step_settings writes, etc.). The mechanism added here can be reused per step in a follow-up PR.
  • Automatic backoff retry on transient failures.
  • Actually starting ollama serve from the wizard.

Test plan

  • uv run ruff check + format --check packages/memtomem/{src,tests} clean
  • uv run pytest -m "not ollama" — 3873 passed
  • New unit tests pin the StepRetry / fail_step choice→exception mapping (12 cases) and the _ollama_has_model tri-state contract (5 cases)
  • New integration tests pin the [feature]: add guardrails for failed steps in the mm init wizard #626 invariant: _step_embedding does NOT silently advance to the next step on Ollama failure (sentinel next-step check)
  • Manual pty-driven smoke against mm init --advanced with a fake ollama binary, in 5 scenarios:
    • A (server unreachable + q) — red ✗ Ollama server is not reachable. prompt fires, wizard cancels (exit 0). The Starting 'ollama serve'... lie is gone.
    • B (server unreachable + b) — falls back to (already at first step), re-runs _step_embedding from the top. Wizard never advances past it.
    • C (server unreachable + r with server still broken + q) — outer retry loop re-runs the server check only; the embedding-provider menu does NOT re-render.
    • D (server up, model missing, pull fails + r + still fails + q) — inner retry loop re-runs only the pull action; the model-selection menu is NOT re-prompted.
    • E (pull fails + b) — wizard returns to the embedding choice; _step_reranker (next step) never runs. This is the core [feature]: add guardrails for failed steps in the mm init wizard #626 invariant.

🤖 Generated with Claude Code

pandas-studio and others added 2 commits May 1, 2026 19:26
The `mm init` wizard previously continued silently after a step failed,
which is how the user reported scenario looked: a broken Ollama server
caused `_ollama_has_model` to misread "server timed out" as "model not
installed", the wizard offered a doomed `ollama pull`, the pull also
failed, and the wizard then advanced straight to the Reranker step
with no acknowledgement that anything was wrong.

This change adds an explicit recovery prompt at each Ollama failure
point. The user must now choose retry / back / quit — the wizard never
silently moves past a failed action.

Mechanism (`wizard.py`):
- `StepRetry` exception, caught by `run_steps` to re-invoke the same
  step, or by the step itself to retry only an inner action without
  re-prompting earlier inputs.
- `fail_step(message, *, retryable=True)` helper that surfaces a red
  failure message and prompts for r/b/q (or just b/q when
  `retryable=False`). Always raises — never returns.

Wiring (`init_cmd.py`):
- `_ollama_has_model` now returns `bool | None` so callers can
  distinguish "server unreachable / errored" (`None`) from "model
  absent" (`False`). The original `False`-on-error conflation was the
  root of the misread.
- `_step_embedding` Ollama branch: server-reachability check, model
  verification, and `ollama pull` are each guarded with `fail_step`.
  The action region is wrapped in nested retry loops so the outer
  `StepRetry` re-runs server check + model select, and the inner
  `StepRetry` re-runs only the pull.
- The misleading "Ollama not running. Starting 'ollama serve'..."
  message is dropped — the wizard never actually started the server,
  it only printed the line. The new path tells the user to start it
  themselves and offers retry once it is up.

Incidental cleanup: the OpenAI branch's "API key invalid + decline to
continue" path used `raise SystemExit(1)`. Switched to `WizardCancel`
so cancellation is uniform with `q`. Exit code changes from 1 to 0 —
this is a deliberate cancel, not an error.

Out of scope (follow-up): silent-continue audit of other steps
(`_step_memory_dir` dir creation, `_step_settings` writes), automatic
backoff retry, and starting `ollama serve` from the wizard.

Co-Authored-By: Claude <[email protected]>
…CHANGELOG

Addresses self-review feedback on #661:

- Drop the 2-line "Was raise SystemExit(1)" comment in init_cmd.py.
  CLAUDE.md prohibits removed-code comments; the context already lives
  in the commit history and PR body.
- Add ``test_invalid_key_decline_continue_raises_wizard_cancel`` and
  ``test_invalid_key_decline_does_not_advance_to_next_step`` to pin
  the OpenAI bad-key + decline → ``WizardCancel`` change. The exit
  code 1 → 0 shift is now load-bearing in tests, so a future revert
  is loud.
- Add ``test_has_model_none_r_retries_then_succeeds`` symmetric to the
  pull-failure retry test — covers the tri-state ``None`` retry path
  via the outer loop, complementing the existing ``b`` branch test.
- Add a CHANGELOG entry under [Unreleased] / Changed documenting the
  wizard guardrail and the OpenAI exit-code behavior change so users
  monitoring exit codes notice on upgrade.

Co-Authored-By: Claude <[email protected]>
@memtomem
Copy link
Copy Markdown
Owner Author

memtomem commented May 1, 2026

Admin-merging despite test (macos-latest) red.

Evidence the failure is pre-existing and unrelated to this PR:

  • 3 consecutive macOS failures, all in packages/memtomem/tests/test_embedding_progress.py (introduced in feat(indexing): per-chunk progress events for SSE indexing stream #653):
    1. test_onnx_streams_progress_per_yield[asyncio]expected 5 calls, got []
    2. test_onnx_streams_progress_per_yield[asyncio] — same assertion (rerun chore: prepare for open-source release #1)
    3. test_onnx_throttles_thread_hops_for_large_input[asyncio]expected ~20 throttled ticks for n=200, got 0
  • The same test family failed on the e034d10 main run (the very PR that introduced these tests). The pattern is on_progress callbacks not firing on the macOS GitHub runner with Python 3.14.
  • This PR's diff is scoped to cli/wizard.py, cli/init_cmd.py, the corresponding tests, and CHANGELOG.md — nothing touching ONNX, async progress, or embedding internals.
  • Local pytest on macOS passes the entire test_embedding_progress.py file (12/12).
  • All other required checks green: lint, notebooks, golden-path (ONNX bge-m3), test (ubuntu-latest), typecheck. Windows is not in the required list.

The flake is upstream of this PR and should be triaged separately.

@memtomem memtomem merged commit 7c74c49 into main May 1, 2026
19 of 25 checks passed
@memtomem memtomem deleted the fix/wizard-guardrails-626 branch May 1, 2026 10:55
@github-actions github-actions Bot locked and limited conversation to collaborators May 1, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants