Skip to content

docs: v0.33 release notes and version bump#96

Merged
nesquena-hermes merged 1 commit intomasterfrom
docs/v0.33-release
Apr 5, 2026
Merged

docs: v0.33 release notes and version bump#96
nesquena-hermes merged 1 commit intomasterfrom
docs/v0.33-release

Conversation

@nesquena-hermes
Copy link
Copy Markdown
Collaborator

Post-merge docs for PR #93 + fix #95 (/insights sync feature + state_sync.py correctness fixes).

  • CHANGELOG.md: add v0.33 entry
  • ROADMAP.md: update version/date
  • SPRINTS.md: update current version
  • static/index.html: bump sidebar label to v0.33

No code changes. 424 tests pass.

@nesquena-hermes nesquena-hermes merged commit 6d4c258 into master Apr 5, 2026
@nesquena-hermes nesquena-hermes deleted the docs/v0.33-release branch April 5, 2026 03:10
roadhero added a commit to fox-in-the-box-ai/hermes-webui that referenced this pull request May 5, 2026
…squena#96) (#2)

* feat(tailscale): backend orchestration for desktop auth flow (nesquena#96 phase 1)

New api/tailscale.py module wraps the in-container `tailscale` CLI so
the desktop app can drive auth without docker-exec. Six endpoints:

  GET  /api/tailscale/status      — daemon + tailnet snapshot
  POST /api/tailscale/up          — start interactive or auth-key auth
  GET  /api/tailscale/up/poll     — current state of the in-flight attempt
  POST /api/tailscale/logout      — disconnect
  GET  /api/tailscale/serve       — current Serve config
  POST /api/tailscale/serve       — re-run `tailscale serve --bg / 8787`

Up-state machine:
  idle → starting → awaiting-auth → running   (success)
                 ↘ failed                     (timeout / non-zero rc)

The /up handler spawns `tailscale up` in a daemon thread, scrapes
stdout for the auth URL (Tailscale cloud + headscale-style URLs),
and returns immediately. The HTTP request never blocks on the user's
browser interaction. /up/poll reads the shared state and also peeks
at `tailscale status --json` so it can promote to `running` even
if the subprocess hasn't exited yet.

start_up() is idempotent: a second call while an attempt is in flight
returns the existing auth_url. Stale attempts (older than the timeout)
are reset.

argv supports the full Phase-2 power-user flag set already
(--login-server, --advertise-routes, --advertise-tags, --accept-routes,
--accept-dns, --exit-node) — but the UI only exposes Connect/Auth-key
in Phase 1. Phase 2 will surface them.

Part of v0.4.4. Closes nesquena#96 phase 1 backend.

* feat(tailscale): Settings → Network connection panel (nesquena#96 phase 1 frontend)

Adds a Tailscale connection tile to Settings (right above the existing
hostname tile from nesquena#44). Wires the new backend endpoints:

  - Connection status badge (Connected / Connecting / Needs login /
    Disconnected / Unknown), tailnet HTTPS URL, Connect / Disconnect
    buttons.
  - Advanced accordion exposes the auth-key field — the Phase-1 power
    user escape hatch. Phase 2 will surface the rest (login-server,
    advertise-routes, advertise-tags, accept-routes, accept-dns,
    exit-node) — argv builder already handles them server-side.

Connect flow:
  1. POST /api/tailscale/up with optional {auth_key}
  2. Polls /api/tailscale/up/poll every 2s
  3. First time an auth_url surfaces, opens it in a new tab via
     window.open. Subsequent polls don't re-open (would spam tabs)
  4. When state === 'running', stops polling, refreshes the status
     snapshot, badge flips to Connected

Disconnect flow: confirm, POST /logout, refresh.

The tile loads alongside the existing settings probes — best-effort
catch keeps Settings resilient if the new endpoints 404.

---------

Co-authored-by: Dennis <[email protected]>
roadhero added a commit to fox-in-the-box-ai/hermes-webui that referenced this pull request May 5, 2026
…l + banner) (#3)

* feat(tailscale): power-user fields in Settings advanced accordion (nesquena#96 phase 2)

Surfaces the 6 power-user Tailscale flags that nesquena#96 phase 1's argv
builder already accepted but the UI didn't expose:

  - Login server (custom control plane URL — headscale, on-prem)
  - Advertise routes (subnet-router CIDRs)
  - Advertise tags (ACL identity)
  - Exit node (route all traffic via a peer)
  - Accept routes (consume peers' subnet routes)
  - Accept DNS (MagicDNS, default on)

Closes Persona 3 of nesquena#96.

Backend:
- api/config.py: 6 new keys in _SETTINGS_DEFAULTS, the two booleans
  added to _SETTINGS_BOOL_KEYS allowlist
- api/tailscale.py: _load_persisted_opts() reads them as the
  body-shape that _build_up_argv() expects; get_status() exposes
  current values under config{}; start_up() merges body opts on
  top of persisted (body wins per-key, empty string falls through)

Frontend:
- static/index.html: expand Advanced accordion in the Tailscale
  tile with 4 text inputs, 2 checkboxes, Save button, status line
- static/panels.js: _tsPopulateAdvanced() pre-fills from
  state.config on Settings load; tsSaveAdvanced() POSTs all 6
  values to the existing /api/settings endpoint (which already
  validates types via _SETTINGS_BOOL_KEYS / allowlist)

Persistence is via settings.json — same store as everything else.
Power user can edit, Save, Connect; values survive container
restarts and pre-populate next session.

Phase 1 of v0.4.6.

* feat(fallback): reactive modal + recovery banner (#9 polish)

Two pieces of UX layered on top of v0.4.1's silent failover:

1. Reactive modal — when a chat stream fails on a remote provider AND
   the user has NOT opted into local fallback, surface a one-time
   modal offering to enable it. Fires only on error types where local
   would actually help (stream_interrupted, rate_limit, no_response,
   unknown). Skips eligibility-killers like auth_mismatch /
   model_not_found / quota_exhausted that local can't fix.

2. Recovery banner — when local fallback IS enabled, periodically poll
   /api/local-fallback/remote-health (new endpoint that probes
   openrouter.ai/api/v1/models with a 5s timeout, 30s in-process
   cache). When remote is reachable again, show a top banner offering
   to switch off local fallback.

Architecture:
- Surgical one-line dispatch in messages.js apperror handler:
  window.dispatchEvent(new CustomEvent('fitb:stream-error', {detail}))
- New fallback-polish.js IIFE listens for the event, talks to
  /api/local-fallback/{status,enable,disable,remote-health}
- sessionStorage flags for "modal seen" / "banner dismissed" so the
  UI doesn't pester. Reset on page reload.
- New CSS in fox-in-the-box.css styles the modal + banner using
  existing CSS vars (theme-aware).

Backend:
- local_fallback.get_remote_health() — urllib.request, 5s timeout,
  30s in-process cache to keep multi-tab polling cheap. Returns
  {remote_healthy, tested_url, error}. OpenRouter is the FITB default
  provider; its public /models endpoint is the canonical "internet
  works and primary provider is up" probe.
- routes.py — wires GET /api/local-fallback/remote-health.

Phase 2 of v0.4.6.

---------

Co-authored-by: Dennis <[email protected]>
roadhero added a commit to fox-in-the-box-ai/hermes-webui that referenced this pull request May 5, 2026
…fallback + onboarding) (#4)

* fix(security): XSS in wizard onclick + Tailscale flag-injection (P0)

QA pass v0.4.7 — Wave A (P0 security).

XSS — setup.js useLocalOllama inline onclick (P0)
  Model names from a remote/host Ollama daemon were interpolated into
  an `onclick="useLocalOllama('...')"` attribute via escapeHtml(),
  which is HTML-attribute-safe but does NOT safely escape JS string
  literals. A backslash + apostrophe in a model name could break out
  of the JS string and execute arbitrary code in the wizard. Fix:
  use a data-action / data-model attribute pair and a single delegated
  click listener instead of building inline onclick strings.

Tailscale flag injection — _build_up_argv argv unchecked (P0)
  Power-user strings (login_server, advertise_routes, advertise_tags,
  exit_node) flowed straight into `tailscale up --flag=value` argv
  with no validation. shell=False prevented classic shell injection
  but a malicious settings POST could still:
    - prefix a value with `-` to inject another flag (smuggling
      --auth-key= via a routes value)
    - point --login-server at attacker-controlled headscale to
      redirect the auth flow
    - smuggle newlines / control chars
  Fix: add `_validate_ts_opt()` with per-field regex (URL, CIDR list,
  tag:* regex, hostname RFC 1035, host/IP charset). Validators are
  enforced in TWO places — _build_up_argv (argv-time gate) and a new
  validate_settings_dict() called from config.save_settings (POST-
  time gate). Defense in depth.

Bool coercion — bool("false") is True (P1 footgun)
  config.save_settings used `bool(v)` for _SETTINGS_BOOL_KEYS. Any
  non-empty string coerced to True — a curl POST with
  {"hostname_prompted":"false"} silently flipped the flag to True,
  locking users out of the post-wizard prompt. Fix: explicit
  string-to-bool coercion accepting "true"/"1"/"yes"/"on" and
  "false"/"0"/"no"/"off"/"", rejecting other types (key skipped).

Wizard race — Next button enabled before probes return (P2)
  setup.js renders Step 1 immediately then awaits parallel probes for
  Ollama / local-fallback / welcome text. A fast user clicking Next
  before probes resolved missed the local-model fast-paths. Fix:
  render Next as disabled with a "Detecting local options…" spinner
  while state.ollama and state.localFallback are both null; re-render
  enables it once probes settle.

Smoke-tested: argv-time validators reject leading dash, newlines,
shell-metachar URLs, oversized values; accept well-formed CIDRs,
tag:* strings, normal Tailscale URLs.

* fix(tailscale): concurrency, lifecycle, hostname persist, logout serve cleanup (P1)

QA pass v0.4.7 — Wave B (Tailscale state machine).

Six related P1 bugs from the QA review, all in api/tailscale.py:

1. _up_proc / _up_log written outside _up_lock
   start_up() read _up_proc.poll() under the lock but the daemon thread
   wrote _up_proc = subprocess.Popen(...) and _up_proc = None outside
   the lock. A user clicking Connect during the daemon's brief teardown
   window observed a dead Popen, decided "in flight", and returned
   {reused: True} — silently dropping the click. The new attempt
   never started; UI sat on "Starting…" forever.

2. start_up() from terminal failed/running didn't reset _up_proc
   The "starting/awaiting-auth" guard fell through but _up_proc stayed
   as the previous (dying) handle. The new spawn raced with the dying
   thread's _set_up_state — new attempt's `starting` got flipped back
   to `failed` by the previous thread the moment its wait() returned.

3. _up_subprocess used global _up_proc mid-loop
   `_up_proc.kill()`, `.stdout.readline()`, `.wait()` all dereferenced
   the GLOBAL each iteration. A second start_up() that reassigned
   _up_proc made the first thread's wait() block on the NEW
   subprocess, then write its return code over the new attempt.
   Capture-by-reference-via-globals classic.

4. logout() didn't kill in-flight _up_proc
   logout() reset _up_state to idle but left the daemon thread alive.
   If the user disconnected while in awaiting-auth, the orphaned
   subprocess could later complete (or hit its 600s timeout) and its
   terminal handler flipped state to running/failed over the freshly-
   reset idle. Badge said "Connected" minutes after explicit disconnect.

5. Hostname dropped from _load_persisted_opts
   The persisted-opts loader returned the six power-user fields but
   not hostname. _build_up_argv only got hostname when the body
   explicitly carried one (Connect path) — but Reconnect after logout
   passes an empty body, so the saved FOX_HOSTNAME was silently
   replaced with whatever Tailscale's default-naming picked
   (typically the container ID).

6. logout() didn't clean Serve config; Reconnect didn't auto-Serve
   `tailscale logout` left the Serve binding pointing at a
   now-disconnected tunnel. After a fresh Connect under a new tailnet
   identity, Serve was missing until the user manually poked
   /api/tailscale/serve. The auto-config in entrypoint.sh only fires
   at container boot, not on Reconnect.

Plus QA known-bug #1 (subprocess readline blocking past deadline).

Fix: introduce attempt-id on _up_state. Each daemon thread captures
its attempt_id on spawn; _set_up_state takes attempt_id and silently
drops stale-thread updates. _up_subprocess uses a LOCAL `proc`
variable for all subprocess ops (never re-reads the global). All
shared-state mutations now go through _up_lock. start_up() actively
kills the previous proc on retry-after-terminal. logout() kills
in-flight, bumps attempt_id, resets Serve via `serve reset` (with
`serve --remove /` fallback for older builds). _load_persisted_opts
now reads hostname from /data/config/hermes.env (the nesquena#44 helper).
After detecting BackendState=Running, the daemon auto-calls
configure_serve() so Reconnect publishes HTTPS without manual action.

readline() blocking fix: select() with 1s timeout wraps stdout
reads; the deadline check fires every loop iteration regardless of
whether the subprocess produces output.

Smoke-tested: imports clean; _load_persisted_opts now contains
hostname key.

* fix(backend): use_model secret cleanup, download-failure recovery, multi-provider probe (P1)

QA pass v0.4.7 — Wave C (remaining backend P1s).

1. ollama.use_model leaked stale provider keys
   The function popped only `api_key` from the existing model_cfg,
   leaving azure_endpoint, azure_api_version, aws_region,
   aws_access_key_id, aws_secret_access_key, vertex_project,
   openai_organization, custom headers, etc. in place when switching
   to local Ollama. At minimum a stale-secrets exposure in a config
   file the user thought they had switched away from; at worst those
   keys riding along on Ollama requests if the OpenAI-compat client
   forwarded them. Fix: replace the model block wholesale —
   {provider:"custom", base_url, name} only.

2. local_fallback._start_when_ready ignored download failures
   The 10-minute watcher polled _is_final_present() every 1s but
   never checked the download job's actual state. If the job moved
   to `failed` or `cancelled`, the watcher silently slept until
   deadline; the user sat on the wizard's 30-minute polling clock
   seeing "Downloading X%" for an extra 20 minutes after the actual
   failure. Fix: also poll list_models() inside the watcher; if
   status is failed/cancelled, log + bail. The wizard's progress UI
   already surfaces the download-manager's real state via /status,
   so the user gets a real error instead of a fake-still-downloading.

3. Recovery banner OpenRouter-hardcoded probe was misleading
   Previously /api/local-fallback/remote-health probed
   openrouter.ai/api/v1/models exclusively, which lied to Anthropic /
   OpenAI direct / custom-provider users — they could see "remote is
   back" when their actual provider was still down, switch off local
   fallback, and have their next chat fail. Fix: probe multiple
   provider hosts in order (OpenRouter, OpenAI, Anthropic). Declare
   healthy if ANY responds, including 401/403 — those mean the
   network path works (auth failure is what a real chat surfaces).
   Still has the 30s cache so multi-tab polling stays cheap.

Smoke-tested: imports clean.

* fix(frontend): modal/banner state, hostname-prompt gating, tile auto-refresh, dispatch correctness, indeterminate progress bar (P1/P2)

QA pass v0.4.7 — Wave D (frontend polish bugs).

1. Reactive modal: MODAL_DISMISSED flag set on entry, not on dismiss
   Previously sessionStorage MODAL_DISMISSED was set inside
   showReactiveModal() before the modal even rendered. If the user
   never clicked anything (focus elsewhere, modal stacked under
   another), or if `enable` failed, the modal was still locked-out
   for the rest of the session. Fix: introduce closeAndDismiss(mark)
   helper that sets the flag only on Dismiss / Escape / successful
   Enable. Failed Enable re-enables the buttons and leaves the modal
   open. Adds a tighter _modalOpen guard against rapid double-fire,
   and Escape-to-dismiss for accessibility.

2. Recovery banner kept polling forever after shown
   recoveryTick scheduled the next 90s tick unconditionally; once the
   banner was visible, that meant a 90s heartbeat to the probe URLs
   (OpenRouter / OpenAI / Anthropic) for the lifetime of the open tab.
   Fix: banner-visible check at top of recoveryTick — if _bannerNode
   is set, stopRecoveryPolling and bail. Polling resumes only when
   the banner is dismissed or switched away from.

3. apperror dispatch was inside JSON.parse try-block
   A malformed apperror payload (truncated stream, non-JSON frame)
   skipped the fitb:stream-error dispatch entirely — meaning the
   reactive modal missed exactly the kinds of transient failures
   it's most useful for. Fix: dispatch always fires; falls back
   to {type:'unknown'} if the inner JSON.parse fails.

4. Hostname prompt fired pre-Running
   _read_effective_hostname returns Self.HostName which is populated
   well before BackendState reaches Running. The post-wizard prompt
   fired during NeedsLogin / NeedsMachineAuth state — users dismissed
   it ("why are you asking? I haven't connected yet"), and Skip
   persists prompted=true so they never got the prompt again after
   they actually joined the tailnet. Fix: hostname.py exposes
   backend_state in the GET response; hostname-prompt.js gates on
   backend_state === 'Running'.

5. Settings → Network tile didn't auto-refresh
   Tile loaded once on Settings tab open. If the tunnel dropped
   while the user was looking at Settings, the badge stayed stale.
   Fix: 15s lightweight status-refresh poller while the tile is in
   the DOM and the tab is visible. Defers to the active connect-flow
   poller (2s) when one is running.

6. Saved Tailscale advanced settings didn't auto-apply
   tsSaveAdvanced() said "Will apply on next Connect" and left the
   user to manually reconnect. Fix: after a successful save, if
   Tailscale is currently Running, prompt to reconnect inline; on
   confirm, logout + up under the new persisted settings without
   the user leaving Settings.

7. Save error messages were generic HTTP codes
   The new validators (Wave A) reject malformed values with reason
   strings like "advertise_routes: '10.0.0/24' is not a CIDR".
   Previously the UI just showed "Save failed (HTTP 400)". Fix: parse
   the error JSON and surface its `error` / `message` field.

8. Indeterminate progress bar when bytes_total unknown
   In setup.js's useLlamaCppFallback progress UI, an unknown
   bytes_total (download manager hadn't recorded size yet, or the
   mirror didn't expose Content-Length) rendered the bar as a
   0%-filled track that looked frozen. Fix: animated indeterminate
   stripe variant when total is 0, with new CSS keyframes.

9. completSetup → completeSetup
   Cosmetic — typo. Future grep-for-completeSetup wouldn't find it.

Smoke-tested: all 5 JS files pass `node --check`; Python imports clean.

* fix(qa): regressions caught by adversarial review of v0.4.7 fixes (Wave F)

QA pass v0.4.7 — Wave F (regressions in earlier waves of this PR).

Two P1 regressions in Waves B–D and three P2 hardenings:

P1: get_up_progress() can resurrect "running" after explicit logout
  Wave B's _set_up_state(attempt_id, ...) silently drops stale-thread
  updates when attempt_id mismatches — but get_up_progress() called
  _set_up_state(state="running", ...) WITHOUT an attempt_id argument.
  With None as attempt_id, the guard fell through and applied the
  update unconditionally. A poll firing right after logout() bumped
  attempt_id to N+1 could write state="running" over the freshly-
  cleared idle. Badge said "Connected" minutes after the user
  explicitly disconnected — exactly the regression Wave B was meant
  to eliminate. Fix: pass snap.get("attempt_id") so a stale poll
  cannot stomp logout's reset.

P1: tsSaveAdvanced validator-error UX never reached the user
  Wave A added a Tailscale validator gate inside save_settings that
  raises ValueError("Tailscale setting rejected: ..."). Wave D added
  frontend code to surface the validator's error message inline. But
  the call site in routes.py:3163 wasn't wrapped in try/except — the
  ValueError propagated to do_POST's catch-all and the user saw
  "Internal server error" (HTTP 500). Fix: wrap save_settings() in
  try/except ValueError → return bad(handler, str(e), 400) so the
  validator's specific reason ("advertise_routes: '10.0.0/24' is
  not a CIDR") reaches the UI.

P2: tailscale serve --remove / is invalid CLI syntax
  Wave B's logout serve cleanup chain was ["serve","reset"] then
  ["serve","--remove","/"]. The second has never been valid Tailscale
  syntax — `serve` doesn't accept --remove. The legacy syntax is
  `tailscale serve / off`. On older builds the fallback was dead
  code and the Serve binding persisted. Fix: replace fallback with
  ["serve","/","off"].

P2: configure_serve() ran outside the lock, no attempt_id check
  Wave B's auto-Serve-on-Running ran configure_serve() outside
  _up_lock without checking the attempt was still current. A logout
  firing between the rc=0 detection and configure_serve() produced
  Serve→logged-out tunnel. Fix: re-check attempt_id under the lock
  before calling configure_serve.

P2: _TS_HOST_RE allowed `/` in exit_node
  Copy-paste from URL regex. tailscale up --exit-node never accepts
  values containing `/`. Fix: tighten to ^[a-zA-Z0-9.\\-_:]+$ —
  still accepts IPv4, IPv6 (`::1`), and short hostnames.

P2: Modal Escape listener leak across non-Escape dismiss paths
  onKey listener was only removed inside its own handler when Escape
  fired. Dismiss/Enable paths left it attached with a closed-over
  `wrap` node. Fix: closeAndDismiss helper removes the listener
  unconditionally; onKey is hoisted above closeAndDismiss so the
  helper can reference it.

P2: _remote_health_cache shared across threads without a lock
  Recovery-banner polls from multiple tabs raced. Fix: threading.Lock
  around cache reads and writes (cheap; cache only mutates every 30s
  in steady state).

Smoke-tested: imports clean; exit_node regex rejects `foo/bar`,
accepts `::1`; logout uses `/off`; cache lock present.

* fix(tailscale): operator delegation + status-JSON auth URL scrape (Wave G — desktop flow was broken)

QA pass v0.4.7 — Wave G. End-to-end testing in a real container surfaced
two showstopper bugs in v0.4.4's nesquena#96 Phase 1 desktop Tailscale flow.
The flow has been shipping non-functional in v0.4.4–v0.4.6: the user
clicks Connect, sees "Starting…", and watches it eventually fail or
sit forever. Two related fixes:

1. webui couldn't talk to tailscaled: "Access denied: login access denied"
   tailscaled runs as root (for NET_ADMIN); the webui process runs as
   `foxinthebox` per supervisord.conf. Without `tailscale set
   --operator=foxinthebox`, every CLI call from webui errors with
   "Access denied" — `tailscaled` won't take user-mode CLI commands
   from a non-root, non-operator user. Fix lives in entrypoint.sh
   (separate commit on the main repo) — set --operator after the
   daemon is reachable. Plus: every `tailscale up` invocation from
   webui must pass --operator=foxinthebox explicitly because
   Tailscale requires non-default flags to be re-stated on every
   `up` invocation. Without this, foxinthebox-issued `up` errors
   with: "changing settings via 'tailscale up' requires mentioning
   all non-default flags."

2. Auth URL never reached the UI — Python pipe vs Tailscale block buffer
   The previous implementation read `tailscale up`'s stdout via
   subprocess.PIPE expecting line-by-line output. But Tailscale
   block-buffers stdout when not attached to a TTY — the auth URL
   sits in tailscale's internal buffer until the subprocess exits.
   Verified empirically: subprocess pid 167 was running, daemon
   reported AuthURL in `tailscale status --json`, but Python's
   readline() never returned. UI sat on "Starting…" forever.
   Fix: scrape the AuthURL from `tailscale status --json` instead.
   The daemon has the URL the moment `up` issues a fresh login;
   we don't need to read the subprocess's stdout at all. Subprocess
   stays alive (waiting on the user's browser auth) and we promote
   to "running" as soon as `BackendState == Running`, then SIGKILL
   the subprocess. Stdout is still drained best-effort with a
   non-blocking select() poll for diagnostic context on failures.

End-to-end test (in container against real Tailscale daemon):
  - POST /api/tailscale/up returns starting
  - Within 3s, /up/poll returns awaiting-auth + valid auth_url
    (e.g. https://login.tailscale.com/a/19107e8501687c)
  - POST /logout transitions to idle, attempt_id bumped
  - Immediate re-Connect starts fresh (reused=false), works cleanly
  - No orphan `tailscale up` subprocesses after logout

---------

Co-authored-by: Dennis <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants