Skip to content

fix(ui): defer live-model-fetch race that silently overwrites session model#1174

Closed
nesquena-hermes wants to merge 1 commit intomasterfrom
fix/1169-session-model-race
Closed

fix(ui): defer live-model-fetch race that silently overwrites session model#1174
nesquena-hermes wants to merge 1 commit intomasterfrom
fix/1169-session-model-race

Conversation

@nesquena-hermes
Copy link
Copy Markdown
Collaborator

Summary

Fixes a race condition where loading any session with a Nous-routed (or other live-only) model silently overwrites the session's model on disk with the first static model in the dropdown — typically @nous:anthropic/claude-opus-4.6. Reported with Kimi K2.6 (moonshotai/kimi-k2) but affects any model not present in the static 4-item list.

Root cause

The bug is a timing race between two async operations:

  1. _resolveSessionModelForDisplaySoon() schedules a setTimeout(..., 0) that clears _modelResolutionDeferred and calls syncTopbar().
  2. syncTopbar() calls _applyModelToDropdown(sessionModel, sel). If the model is absent from the dropdown and _modelResolutionDeferred is already false, it immediately sets sel.value = first.value (e.g. Opus 4.6) and persists that wrong model via POST /api/session/update.
  3. _fetchLiveModels() runs in parallel — it takes 100–500ms to fetch the live model list from the provider API. By the time it completes and adds Kimi to the dropdown, the session model on disk has already been corrupted.

The existing _modelResolutionDeferred flag was designed to protect against this, but it only defers until GET /api/session?resolve_model=1 returns — it doesn't account for the live model fetch still being in flight.

Fix

Two changes:

  1. _liveModelFetchPending set — tracks which providers have a live fetch in flight. _fetchLiveModels() adds the provider to the set before the fetch and removes it in a finally block. When syncTopbar() detects a model is missing, it checks this set: if the active provider's fetch is pending, the fallback logic is skipped entirely (no sel.value override, no disk write).

  2. Re-apply in _addLiveModelsToSelect() — after inserting newly fetched models, re-apply S.session.model if it is now resolvable. This ensures the correct model is applied even if syncTopbar() already ran and briefly showed the wrong model in the chip.

Testing

  • All 2634 tests pass.
  • Manual: configure a Nous profile with model.default: moonshotai/kimi-k2 → load any session → model chip shows Kimi throughout; no incorrect model persisted.

Closes #1169

… model (#1169)

When a user has a provider-specific model configured (e.g. Kimi K2 via Nous)
that isn't in the static model list, syncTopbar() fires a setTimeout callback
that finds the model absent from the dropdown and persists the wrong fallback
model (first static option, e.g. Opus 4.6) to disk — before _fetchLiveModels()
returns. This corrupts the session on every load.

Fix has two parts:
1. Track _liveModelFetchPending per provider. While a live fetch is in flight,
   syncTopbar()'s 'model not found' branch skips both the sel.value override
   and the /api/session/update persist call entirely.
2. In _addLiveModelsToSelect(), after inserting new models, re-apply
   S.session.model if it is now resolvable in the updated dropdown. This
   restores the correct model even if syncTopbar() already ran.

Closes #1169
Copy link
Copy Markdown
Owner

@nesquena nesquena left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review — end-to-end ✅ (clean approve)

What this ships

Agent's fix for #1169 — race condition where loading any session with a model only present in the live catalog (e.g. Kimi K2 via Nous) silently overwrites the session model on disk with the first static option (often Opus 4.6). 39 +/15 - in static/ui.js only.

This is the agent-prepared version; @bergeouss filed a parallel attempt as PR #1173 with a different shape. This PR is cleaner — see comparison below.

Traced against upstream hermes-agent

Pulled fresh nousresearch/hermes-agent tarball. Pure WebUI-frontend race; _liveModelFetchPending, _addLiveModelsToSelect, _applyModelToDropdown, and syncTopbar are WebUI-internal. No agent-side coupling. The session model that gets persisted via /api/session/update is the same field the agent reads on load — so any correct model write is fine, and the bug fix is about preventing incorrect writes during the race.

End-to-end trace

_liveModelFetchPending flag — Set keyed by provider

static/ui.js:168-172:

const _liveModelFetchPending=new Set();

Set in _fetchLiveModels synchronously before the first await (ui.js:241) and removed in a finally block (ui.js:257-258). Because the synchronous portion runs before the JS event loop yields, by the time populateModelDropdown returns control to its caller and syncTopbar runs, the flag is already set whenever a network fetch is in flight.

Cache-hit path returns early before add() (ui.js:233-237) — by then the dropdown already has the live models synchronously, so syncTopbar will find the model and the flag is unnecessary.

syncTopbar early-return — no UI flicker

static/ui.js:1849-1875:

const liveStillPending=window._activeProvider&&_liveModelFetchPending.has(window._activeProvider);
if(liveStillPending){
  // Live fetch in flight — don't touch sel.value or S.session.model yet.
} else {
  // ... existing reset-and-persist path ...
}

When live fetch is pending, the dropdown is left at its current value (which may be the default/empty option) and S.session.model is unchanged. No visible flicker — the dropdown displays the right value once _addLiveModelsToSelect runs.

This is cleaner than the alternative (set sel.value=first.value while skipping persist) because it avoids briefly showing the wrong model in the chip.

The existing if(!deferModelCorrection) literal is preserved verbatim inside the else branch — keeps the existing test_session_metadata_fast_path regression test intact without modification.

Re-apply on cache+network paths

static/ui.js:222-229_addLiveModelsToSelect re-applies S.session.model after adding live models, regardless of how many were added:

if(S.session && S.session.model && sel.id==='modelSelect'){
  const reapplied=_applyModelToDropdown(S.session.model, sel);
  if(reapplied && typeof syncModelChip==='function') syncModelChip();
}

sel.id==='modelSelect' filters to the chat header's primary picker — won't re-target profile-editor pickers or other dropdowns that share the helper. ✅

Static behaviour harness — 8/8 pass

Set declaration: PASS
add() before fetch(): PASS
delete() in finally: PASS
Cache-hit return before add(): PASS
syncTopbar checks .has(): PASS
liveStillPending early return: PASS
reapply S.session.model: PASS
reapply guarded by sel.id===modelSelect: PASS

Comparison vs PR #1173 (the parallel contributor attempt)

Aspect #1173 (bergeouss) #1174 (this PR)
Flag type object dict {} new Set() — slightly more idiomatic
Re-apply location new _reapplySessionModelIfFound() helper inlined into _addLiveModelsToSelect (DRY — works for cache and network paths)
Re-apply scope S.session && S.session.model same + sel.id==='modelSelect' guard (avoids re-targeting other dropdowns)
syncTopbar while pending sets sel.value=first.value but skips persist (visible flicker) early-return entirely (no flicker)
Persist-correction round-trip re-posts session/update from helper not needed — flag prevents the wrong write in the first place
Existing test breaks if(!deferModelCorrection) literal assertion preserves it

Net: this PR is the cleaner fix. Recommend closing #1173 in favour of this one once merged.

Edge-case trace

Scenario Expected Actual
Session model in static list (e.g. Opus 4.6) found immediately, no race applied=true skips correction branch
Session model only in live catalog (Kimi K2) dropdown waits for fetch, then re-applies ✅ liveStillPending early return + re-apply
Live fetch fails (network error) fallback to first static, persist ✅ finally clears flag → next syncTopbar runs reset
Cache hit (cached models include the session model) model found synchronously ✅ cache-hit returns before add()
Cache hit but cached models don't include session model reset to first + persist (correct behaviour, model truly gone) ✅ flag never set, normal path
Multiple providers, one fetch pending check uses window._activeProvider only ✅ scoped per active provider
sel.id !== 'modelSelect' (e.g. profile editor) re-apply skipped ✅ guarded

Tests

  • All 5 test_session_metadata_fast_path tests: pass.
  • Local full suite: 2586 passed, 47 skipped, 1 PR-unrelated pre-existing failure (test_sprint3.py::test_workspace_add_rejects_system_paths — macOS-only, fails on master too).
  • CI on PR: ✅ test (3.11), ✅ test (3.12), ✅ test (3.13).
  • Static harness: 8/8 pass.

Minor observations (non-blocking)

  • The PR touches ui.js only and doesn't add a regression test for the race itself (just relies on the existing tests not breaking). A behavioural test for the race would be useful but harder — it requires a real DOM and async fetch mock. Acceptable for a focused two-helper fix.
  • _addLiveModelsToSelect's re-apply loops through S.session.model even when nothing was added. Cheap (single dropdown lookup), but a if (added > 0) gate would skip the work in the no-models-arrived-yet case. Cosmetic.

Recommendation

Approved. Tight 39-line fix that closes #1169 cleanly. Set + early-return + DRY re-apply are the right primitives. No UI flicker, no incorrect persist, no cross-tool coupling, existing test preserved verbatim. Static harness verifies all 8 invariants. Parked at approval — ready for the release agent's merge/tag pipeline.

@nesquena-hermes
Copy link
Copy Markdown
Collaborator Author

This is the right architectural fix for the live-model race condition. The root cause is well-identified: _modelResolutionDeferred is cleared in the same setTimeout(..., 0) callback that calls syncTopbar(), so the deferred flag provides zero protection against the live-fetch window.

What the fix gets right:

  • Deferring the model-correction persist until _fetchLiveModels() completes (or fails) ensures the model is never written to disk while the live dropdown is still being built.
  • Re-running _applyModelToDropdown() after live models are added is the clean way to give the correction logic a second chance — this is idempotent and has no effect if the model was already correctly applied.
  • The _liveModelFetchPending guard prevents the POST /api/session/update from firing during the live-fetch window, which is the actual data-loss vector.

Edge case to confirm: if the live fetch fails (network error, provider API down), _liveModelFetchPending should be cleared in the catch path too, not just finally. Otherwise the flag stays set and model corrections are blocked indefinitely. If the code already handles this in a finally block, great — just worth a note for reviewers.

This directly fixes the regression reported in #1169 where Kimi K2.6 sessions were silently reset to Opus 4.6 on every load. Flagging for maintainer review.

nesquena-hermes added a commit that referenced this pull request Apr 27, 2026
…, .env (#1179)

Merged as v0.50.228. 2644 tests passing. Browser QA 21/21 (desktop 1440×900 + mobile iPhone 14). All 5 fix invariants verified live in browser.

**Fix verifications:**
- #1172 (`renderMd` pre-stash): `rawPreStash` present in function, `<pre>` blocks pass through without content rewrite ✅
- #1174 (model race guard): `syncTopbar()` contains `liveStillPending` guard ✅
- #1175 (tool card): `.tool-card-result pre` max-height=360px, `.tool-card.open .tool-card-detail` overflow=auto, cap=600px ✅  
- #1176 (empty session guard): double-click New Conversation on empty session → stays on same session, composer focused ✅
- #1178 (`.env` atomic write): `tempfile.mkstemp + os.replace` in `providers.py`, 9/9 env tests pass ✅

Thanks @bsgdigital (#1150) and @bergeouss (#1178)!
@nesquena-hermes
Copy link
Copy Markdown
Collaborator Author

Merged as v0.50.228 via #1179. Thank you nesquena-hermes!

@nesquena-hermes nesquena-hermes deleted the fix/1169-session-model-race branch April 27, 2026 23:59
JKJameson pushed a commit to JKJameson/hermes-webui that referenced this pull request Apr 29, 2026
…, .env (nesquena#1179)

Merged as v0.50.228. 2644 tests passing. Browser QA 21/21 (desktop 1440×900 + mobile iPhone 14). All 5 fix invariants verified live in browser.

**Fix verifications:**
- nesquena#1172 (`renderMd` pre-stash): `rawPreStash` present in function, `<pre>` blocks pass through without content rewrite ✅
- nesquena#1174 (model race guard): `syncTopbar()` contains `liveStillPending` guard ✅
- nesquena#1175 (tool card): `.tool-card-result pre` max-height=360px, `.tool-card.open .tool-card-detail` overflow=auto, cap=600px ✅  
- nesquena#1176 (empty session guard): double-click New Conversation on empty session → stays on same session, composer focused ✅
- nesquena#1178 (`.env` atomic write): `tempfile.mkstemp + os.replace` in `providers.py`, 9/9 env tests pass ✅

Thanks @bsgdigital (nesquena#1150) and @bergeouss (nesquena#1178)!
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CRITICAL: session model silently reset to wrong model (Opus 4.6) for non-static providers like Nous+Kimi

2 participants