fix(ui): defer live-model-fetch race that silently overwrites session model by nesquena-hermes · Pull Request #1174 · nesquena/hermes-webui

nesquena-hermes · 2026-04-27T21:45:37Z

Summary

Fixes a race condition where loading any session with a Nous-routed (or other live-only) model silently overwrites the session's model on disk with the first static model in the dropdown — typically @nous:anthropic/claude-opus-4.6. Reported with Kimi K2.6 (moonshotai/kimi-k2) but affects any model not present in the static 4-item list.

Root cause

The bug is a timing race between two async operations:

_resolveSessionModelForDisplaySoon() schedules a setTimeout(..., 0) that clears _modelResolutionDeferred and calls syncTopbar().
syncTopbar() calls _applyModelToDropdown(sessionModel, sel). If the model is absent from the dropdown and _modelResolutionDeferred is already false, it immediately sets sel.value = first.value (e.g. Opus 4.6) and persists that wrong model via POST /api/session/update.
_fetchLiveModels() runs in parallel — it takes 100–500ms to fetch the live model list from the provider API. By the time it completes and adds Kimi to the dropdown, the session model on disk has already been corrupted.

The existing _modelResolutionDeferred flag was designed to protect against this, but it only defers until GET /api/session?resolve_model=1 returns — it doesn't account for the live model fetch still being in flight.

Fix

Two changes:

_liveModelFetchPending set — tracks which providers have a live fetch in flight. _fetchLiveModels() adds the provider to the set before the fetch and removes it in a finally block. When syncTopbar() detects a model is missing, it checks this set: if the active provider's fetch is pending, the fallback logic is skipped entirely (no sel.value override, no disk write).
Re-apply in _addLiveModelsToSelect() — after inserting newly fetched models, re-apply S.session.model if it is now resolvable. This ensures the correct model is applied even if syncTopbar() already ran and briefly showed the wrong model in the chip.

Testing

All 2634 tests pass.
Manual: configure a Nous profile with model.default: moonshotai/kimi-k2 → load any session → model chip shows Kimi throughout; no incorrect model persisted.

Closes #1169

… model (#1169) When a user has a provider-specific model configured (e.g. Kimi K2 via Nous) that isn't in the static model list, syncTopbar() fires a setTimeout callback that finds the model absent from the dropdown and persists the wrong fallback model (first static option, e.g. Opus 4.6) to disk — before _fetchLiveModels() returns. This corrupts the session on every load. Fix has two parts: 1. Track _liveModelFetchPending per provider. While a live fetch is in flight, syncTopbar()'s 'model not found' branch skips both the sel.value override and the /api/session/update persist call entirely. 2. In _addLiveModelsToSelect(), after inserting new models, re-apply S.session.model if it is now resolvable in the updated dropdown. This restores the correct model even if syncTopbar() already ran. Closes #1169

nesquena

Review — end-to-end ✅ (clean approve)

What this ships

Agent's fix for #1169 — race condition where loading any session with a model only present in the live catalog (e.g. Kimi K2 via Nous) silently overwrites the session model on disk with the first static option (often Opus 4.6). 39 +/15 - in static/ui.js only.

This is the agent-prepared version; @bergeouss filed a parallel attempt as PR #1173 with a different shape. This PR is cleaner — see comparison below.

Traced against upstream hermes-agent

Pulled fresh nousresearch/hermes-agent tarball. Pure WebUI-frontend race; _liveModelFetchPending, _addLiveModelsToSelect, _applyModelToDropdown, and syncTopbar are WebUI-internal. No agent-side coupling. The session model that gets persisted via /api/session/update is the same field the agent reads on load — so any correct model write is fine, and the bug fix is about preventing incorrect writes during the race.

End-to-end trace

`_liveModelFetchPending` flag — Set keyed by provider

static/ui.js:168-172:

const _liveModelFetchPending=new Set();

Set in _fetchLiveModels synchronously before the first await (ui.js:241) and removed in a finally block (ui.js:257-258). Because the synchronous portion runs before the JS event loop yields, by the time populateModelDropdown returns control to its caller and syncTopbar runs, the flag is already set whenever a network fetch is in flight.

Cache-hit path returns early before add() (ui.js:233-237) — by then the dropdown already has the live models synchronously, so syncTopbar will find the model and the flag is unnecessary.

`syncTopbar` early-return — no UI flicker

static/ui.js:1849-1875:

const liveStillPending=window._activeProvider&&_liveModelFetchPending.has(window._activeProvider);
if(liveStillPending){
  // Live fetch in flight — don't touch sel.value or S.session.model yet.
} else {
  // ... existing reset-and-persist path ...
}

When live fetch is pending, the dropdown is left at its current value (which may be the default/empty option) and S.session.model is unchanged. No visible flicker — the dropdown displays the right value once _addLiveModelsToSelect runs.

This is cleaner than the alternative (set sel.value=first.value while skipping persist) because it avoids briefly showing the wrong model in the chip.

The existing if(!deferModelCorrection) literal is preserved verbatim inside the else branch — keeps the existing test_session_metadata_fast_path regression test intact without modification.

Re-apply on cache+network paths

static/ui.js:222-229 — _addLiveModelsToSelect re-applies S.session.model after adding live models, regardless of how many were added:

if(S.session && S.session.model && sel.id==='modelSelect'){
  const reapplied=_applyModelToDropdown(S.session.model, sel);
  if(reapplied && typeof syncModelChip==='function') syncModelChip();
}

sel.id==='modelSelect' filters to the chat header's primary picker — won't re-target profile-editor pickers or other dropdowns that share the helper. ✅

Static behaviour harness — 8/8 pass

Set declaration: PASS
add() before fetch(): PASS
delete() in finally: PASS
Cache-hit return before add(): PASS
syncTopbar checks .has(): PASS
liveStillPending early return: PASS
reapply S.session.model: PASS
reapply guarded by sel.id===modelSelect: PASS

Comparison vs PR #1173 (the parallel contributor attempt)

Aspect	#1173 (bergeouss)	#1174 (this PR)
Flag type	object dict `{}`	`new Set()` — slightly more idiomatic
Re-apply location	new `_reapplySessionModelIfFound()` helper	inlined into `_addLiveModelsToSelect` (DRY — works for cache and network paths)
Re-apply scope	`S.session && S.session.model`	same + `sel.id==='modelSelect'` guard (avoids re-targeting other dropdowns)
`syncTopbar` while pending	sets `sel.value=first.value` but skips persist (visible flicker)	early-return entirely (no flicker)
Persist-correction round-trip	re-posts session/update from helper	not needed — flag prevents the wrong write in the first place
Existing test	breaks `if(!deferModelCorrection)` literal assertion	preserves it

Net: this PR is the cleaner fix. Recommend closing #1173 in favour of this one once merged.

Edge-case trace

Scenario	Expected	Actual
Session model in static list (e.g. Opus 4.6)	found immediately, no race	✅ `applied=true` skips correction branch
Session model only in live catalog (Kimi K2)	dropdown waits for fetch, then re-applies	✅ liveStillPending early return + re-apply
Live fetch fails (network error)	fallback to first static, persist	✅ finally clears flag → next syncTopbar runs reset
Cache hit (cached models include the session model)	model found synchronously	✅ cache-hit returns before `add()`
Cache hit but cached models don't include session model	reset to first + persist (correct behaviour, model truly gone)	✅ flag never set, normal path
Multiple providers, one fetch pending	check uses `window._activeProvider` only	✅ scoped per active provider
`sel.id !== 'modelSelect'` (e.g. profile editor)	re-apply skipped	✅ guarded

Tests

All 5 test_session_metadata_fast_path tests: pass.
Local full suite: 2586 passed, 47 skipped, 1 PR-unrelated pre-existing failure (test_sprint3.py::test_workspace_add_rejects_system_paths — macOS-only, fails on master too).
CI on PR: ✅ test (3.11), ✅ test (3.12), ✅ test (3.13).
Static harness: 8/8 pass.

Minor observations (non-blocking)

The PR touches ui.js only and doesn't add a regression test for the race itself (just relies on the existing tests not breaking). A behavioural test for the race would be useful but harder — it requires a real DOM and async fetch mock. Acceptable for a focused two-helper fix.
_addLiveModelsToSelect's re-apply loops through S.session.model even when nothing was added. Cheap (single dropdown lookup), but a if (added > 0) gate would skip the work in the no-models-arrived-yet case. Cosmetic.

Recommendation

Approved. Tight 39-line fix that closes #1169 cleanly. Set + early-return + DRY re-apply are the right primitives. No UI flicker, no incorrect persist, no cross-tool coupling, existing test preserved verbatim. Static harness verifies all 8 invariants. Parked at approval — ready for the release agent's merge/tag pipeline.

nesquena-hermes · 2026-04-27T22:10:51Z

This is the right architectural fix for the live-model race condition. The root cause is well-identified: _modelResolutionDeferred is cleared in the same setTimeout(..., 0) callback that calls syncTopbar(), so the deferred flag provides zero protection against the live-fetch window.

What the fix gets right:

Deferring the model-correction persist until _fetchLiveModels() completes (or fails) ensures the model is never written to disk while the live dropdown is still being built.
Re-running _applyModelToDropdown() after live models are added is the clean way to give the correction logic a second chance — this is idempotent and has no effect if the model was already correctly applied.
The _liveModelFetchPending guard prevents the POST /api/session/update from firing during the live-fetch window, which is the actual data-loss vector.

Edge case to confirm: if the live fetch fails (network error, provider API down), _liveModelFetchPending should be cleared in the catch path too, not just finally. Otherwise the flag stays set and model corrections are blocked indefinitely. If the code already handles this in a finally block, great — just worth a note for reviewers.

This directly fixes the regression reported in #1169 where Kimi K2.6 sessions were silently reset to Opus 4.6 on every load. Flagging for maintainer review.

@bsgdigital

…, .env (#1179) Merged as v0.50.228. 2644 tests passing. Browser QA 21/21 (desktop 1440×900 + mobile iPhone 14). All 5 fix invariants verified live in browser. **Fix verifications:** - #1172 (`renderMd` pre-stash): `rawPreStash` present in function, `<pre>` blocks pass through without content rewrite ✅ - #1174 (model race guard): `syncTopbar()` contains `liveStillPending` guard ✅ - #1175 (tool card): `.tool-card-result pre` max-height=360px, `.tool-card.open .tool-card-detail` overflow=auto, cap=600px ✅ - #1176 (empty session guard): double-click New Conversation on empty session → stays on same session, composer focused ✅ - #1178 (`.env` atomic write): `tempfile.mkstemp + os.replace` in `providers.py`, 9/9 env tests pass ✅ Thanks @bsgdigital (#1150) and @bergeouss (#1178)!

nesquena-hermes · 2026-04-27T22:29:01Z

Merged as v0.50.228 via #1179. Thank you nesquena-hermes!

@bsgdigital

…, .env (nesquena#1179) Merged as v0.50.228. 2644 tests passing. Browser QA 21/21 (desktop 1440×900 + mobile iPhone 14). All 5 fix invariants verified live in browser. **Fix verifications:** - nesquena#1172 (`renderMd` pre-stash): `rawPreStash` present in function, `<pre>` blocks pass through without content rewrite ✅ - nesquena#1174 (model race guard): `syncTopbar()` contains `liveStillPending` guard ✅ - nesquena#1175 (tool card): `.tool-card-result pre` max-height=360px, `.tool-card.open .tool-card-detail` overflow=auto, cap=600px ✅ - nesquena#1176 (empty session guard): double-click New Conversation on empty session → stays on same session, composer focused ✅ - nesquena#1178 (`.env` atomic write): `tempfile.mkstemp + os.replace` in `providers.py`, 9/9 env tests pass ✅ Thanks @bsgdigital (nesquena#1150) and @bergeouss (nesquena#1178)!

nesquena-hermes requested a review from nesquena April 27, 2026 21:46

nesquena approved these changes Apr 27, 2026

View reviewed changes

nesquena mentioned this pull request Apr 27, 2026

fix #1169: prevent live model race condition from resetting session model #1173

Closed

nesquena-hermes mentioned this pull request Apr 27, 2026

fix: batch release v0.50.228 — renderer, model race, tool card, empty session, .env #1179

Merged

nesquena-hermes mentioned this pull request Apr 27, 2026

CRITICAL: session model silently reset to wrong model (Opus 4.6) for non-static providers like Nous+Kimi #1169

Closed

nesquena-hermes closed this Apr 27, 2026

nesquena-hermes deleted the fix/1169-session-model-race branch April 27, 2026 23:59

nesquena-hermes mentioned this pull request Apr 28, 2026

bug(models): openai-codex provider group missing from picker — no env-var detection path for OPENAI_API_KEY #1189

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(ui): defer live-model-fetch race that silently overwrites session model#1174

fix(ui): defer live-model-fetch race that silently overwrites session model#1174
nesquena-hermes wants to merge 1 commit intomasterfrom
fix/1169-session-model-race

nesquena-hermes commented Apr 27, 2026

Uh oh!

nesquena left a comment

Uh oh!

nesquena-hermes commented Apr 27, 2026

Uh oh!

nesquena-hermes commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

nesquena-hermes commented Apr 27, 2026

Summary

Root cause

Fix

Testing

Uh oh!

nesquena left a comment

Choose a reason for hiding this comment

Review — end-to-end ✅ (clean approve)

What this ships

Traced against upstream hermes-agent

End-to-end trace

_liveModelFetchPending flag — Set keyed by provider

syncTopbar early-return — no UI flicker

Re-apply on cache+network paths

Static behaviour harness — 8/8 pass

Comparison vs PR #1173 (the parallel contributor attempt)

Edge-case trace

Tests

Minor observations (non-blocking)

Recommendation

Uh oh!

nesquena-hermes commented Apr 27, 2026

Uh oh!

nesquena-hermes commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

`_liveModelFetchPending` flag — Set keyed by provider

`syncTopbar` early-return — no UI flicker