fix(ui): defer live-model-fetch race that silently overwrites session model#1174
fix(ui): defer live-model-fetch race that silently overwrites session model#1174nesquena-hermes wants to merge 1 commit intomasterfrom
Conversation
… model (#1169) When a user has a provider-specific model configured (e.g. Kimi K2 via Nous) that isn't in the static model list, syncTopbar() fires a setTimeout callback that finds the model absent from the dropdown and persists the wrong fallback model (first static option, e.g. Opus 4.6) to disk — before _fetchLiveModels() returns. This corrupts the session on every load. Fix has two parts: 1. Track _liveModelFetchPending per provider. While a live fetch is in flight, syncTopbar()'s 'model not found' branch skips both the sel.value override and the /api/session/update persist call entirely. 2. In _addLiveModelsToSelect(), after inserting new models, re-apply S.session.model if it is now resolvable in the updated dropdown. This restores the correct model even if syncTopbar() already ran. Closes #1169
nesquena
left a comment
There was a problem hiding this comment.
Review — end-to-end ✅ (clean approve)
What this ships
Agent's fix for #1169 — race condition where loading any session with a model only present in the live catalog (e.g. Kimi K2 via Nous) silently overwrites the session model on disk with the first static option (often Opus 4.6). 39 +/15 - in static/ui.js only.
This is the agent-prepared version; @bergeouss filed a parallel attempt as PR #1173 with a different shape. This PR is cleaner — see comparison below.
Traced against upstream hermes-agent
Pulled fresh nousresearch/hermes-agent tarball. Pure WebUI-frontend race; _liveModelFetchPending, _addLiveModelsToSelect, _applyModelToDropdown, and syncTopbar are WebUI-internal. No agent-side coupling. The session model that gets persisted via /api/session/update is the same field the agent reads on load — so any correct model write is fine, and the bug fix is about preventing incorrect writes during the race.
End-to-end trace
_liveModelFetchPending flag — Set keyed by provider
const _liveModelFetchPending=new Set();Set in _fetchLiveModels synchronously before the first await (ui.js:241) and removed in a finally block (ui.js:257-258). Because the synchronous portion runs before the JS event loop yields, by the time populateModelDropdown returns control to its caller and syncTopbar runs, the flag is already set whenever a network fetch is in flight.
Cache-hit path returns early before add() (ui.js:233-237) — by then the dropdown already has the live models synchronously, so syncTopbar will find the model and the flag is unnecessary.
syncTopbar early-return — no UI flicker
const liveStillPending=window._activeProvider&&_liveModelFetchPending.has(window._activeProvider);
if(liveStillPending){
// Live fetch in flight — don't touch sel.value or S.session.model yet.
} else {
// ... existing reset-and-persist path ...
}When live fetch is pending, the dropdown is left at its current value (which may be the default/empty option) and S.session.model is unchanged. No visible flicker — the dropdown displays the right value once _addLiveModelsToSelect runs.
This is cleaner than the alternative (set sel.value=first.value while skipping persist) because it avoids briefly showing the wrong model in the chip.
The existing if(!deferModelCorrection) literal is preserved verbatim inside the else branch — keeps the existing test_session_metadata_fast_path regression test intact without modification.
Re-apply on cache+network paths
static/ui.js:222-229 — _addLiveModelsToSelect re-applies S.session.model after adding live models, regardless of how many were added:
if(S.session && S.session.model && sel.id==='modelSelect'){
const reapplied=_applyModelToDropdown(S.session.model, sel);
if(reapplied && typeof syncModelChip==='function') syncModelChip();
}sel.id==='modelSelect' filters to the chat header's primary picker — won't re-target profile-editor pickers or other dropdowns that share the helper. ✅
Static behaviour harness — 8/8 pass
Set declaration: PASS
add() before fetch(): PASS
delete() in finally: PASS
Cache-hit return before add(): PASS
syncTopbar checks .has(): PASS
liveStillPending early return: PASS
reapply S.session.model: PASS
reapply guarded by sel.id===modelSelect: PASS
Comparison vs PR #1173 (the parallel contributor attempt)
| Aspect | #1173 (bergeouss) | #1174 (this PR) |
|---|---|---|
| Flag type | object dict {} |
new Set() — slightly more idiomatic |
| Re-apply location | new _reapplySessionModelIfFound() helper |
inlined into _addLiveModelsToSelect (DRY — works for cache and network paths) |
| Re-apply scope | S.session && S.session.model |
same + sel.id==='modelSelect' guard (avoids re-targeting other dropdowns) |
syncTopbar while pending |
sets sel.value=first.value but skips persist (visible flicker) |
early-return entirely (no flicker) |
| Persist-correction round-trip | re-posts session/update from helper | not needed — flag prevents the wrong write in the first place |
| Existing test | breaks if(!deferModelCorrection) literal assertion |
preserves it |
Net: this PR is the cleaner fix. Recommend closing #1173 in favour of this one once merged.
Edge-case trace
| Scenario | Expected | Actual |
|---|---|---|
| Session model in static list (e.g. Opus 4.6) | found immediately, no race | ✅ applied=true skips correction branch |
| Session model only in live catalog (Kimi K2) | dropdown waits for fetch, then re-applies | ✅ liveStillPending early return + re-apply |
| Live fetch fails (network error) | fallback to first static, persist | ✅ finally clears flag → next syncTopbar runs reset |
| Cache hit (cached models include the session model) | model found synchronously | ✅ cache-hit returns before add() |
| Cache hit but cached models don't include session model | reset to first + persist (correct behaviour, model truly gone) | ✅ flag never set, normal path |
| Multiple providers, one fetch pending | check uses window._activeProvider only |
✅ scoped per active provider |
sel.id !== 'modelSelect' (e.g. profile editor) |
re-apply skipped | ✅ guarded |
Tests
- All 5
test_session_metadata_fast_pathtests: pass. - Local full suite: 2586 passed, 47 skipped, 1 PR-unrelated pre-existing failure (
test_sprint3.py::test_workspace_add_rejects_system_paths— macOS-only, fails on master too). - CI on PR: ✅ test (3.11), ✅ test (3.12), ✅ test (3.13).
- Static harness: 8/8 pass.
Minor observations (non-blocking)
- The PR touches
ui.jsonly and doesn't add a regression test for the race itself (just relies on the existing tests not breaking). A behavioural test for the race would be useful but harder — it requires a real DOM and async fetch mock. Acceptable for a focused two-helper fix. _addLiveModelsToSelect's re-apply loops throughS.session.modeleven when nothing was added. Cheap (single dropdown lookup), but aif (added > 0)gate would skip the work in the no-models-arrived-yet case. Cosmetic.
Recommendation
Approved. Tight 39-line fix that closes #1169 cleanly. Set + early-return + DRY re-apply are the right primitives. No UI flicker, no incorrect persist, no cross-tool coupling, existing test preserved verbatim. Static harness verifies all 8 invariants. Parked at approval — ready for the release agent's merge/tag pipeline.
|
This is the right architectural fix for the live-model race condition. The root cause is well-identified: What the fix gets right:
Edge case to confirm: if the live fetch fails (network error, provider API down), This directly fixes the regression reported in #1169 where Kimi K2.6 sessions were silently reset to Opus 4.6 on every load. Flagging for maintainer review. |
…, .env (#1179) Merged as v0.50.228. 2644 tests passing. Browser QA 21/21 (desktop 1440×900 + mobile iPhone 14). All 5 fix invariants verified live in browser. **Fix verifications:** - #1172 (`renderMd` pre-stash): `rawPreStash` present in function, `<pre>` blocks pass through without content rewrite ✅ - #1174 (model race guard): `syncTopbar()` contains `liveStillPending` guard ✅ - #1175 (tool card): `.tool-card-result pre` max-height=360px, `.tool-card.open .tool-card-detail` overflow=auto, cap=600px ✅ - #1176 (empty session guard): double-click New Conversation on empty session → stays on same session, composer focused ✅ - #1178 (`.env` atomic write): `tempfile.mkstemp + os.replace` in `providers.py`, 9/9 env tests pass ✅ Thanks @bsgdigital (#1150) and @bergeouss (#1178)!
|
Merged as v0.50.228 via #1179. Thank you nesquena-hermes! |
…, .env (nesquena#1179) Merged as v0.50.228. 2644 tests passing. Browser QA 21/21 (desktop 1440×900 + mobile iPhone 14). All 5 fix invariants verified live in browser. **Fix verifications:** - nesquena#1172 (`renderMd` pre-stash): `rawPreStash` present in function, `<pre>` blocks pass through without content rewrite ✅ - nesquena#1174 (model race guard): `syncTopbar()` contains `liveStillPending` guard ✅ - nesquena#1175 (tool card): `.tool-card-result pre` max-height=360px, `.tool-card.open .tool-card-detail` overflow=auto, cap=600px ✅ - nesquena#1176 (empty session guard): double-click New Conversation on empty session → stays on same session, composer focused ✅ - nesquena#1178 (`.env` atomic write): `tempfile.mkstemp + os.replace` in `providers.py`, 9/9 env tests pass ✅ Thanks @bsgdigital (nesquena#1150) and @bergeouss (nesquena#1178)!
Summary
Fixes a race condition where loading any session with a Nous-routed (or other live-only) model silently overwrites the session's model on disk with the first static model in the dropdown — typically
@nous:anthropic/claude-opus-4.6. Reported with Kimi K2.6 (moonshotai/kimi-k2) but affects any model not present in the static 4-item list.Root cause
The bug is a timing race between two async operations:
_resolveSessionModelForDisplaySoon()schedules asetTimeout(..., 0)that clears_modelResolutionDeferredand callssyncTopbar().syncTopbar()calls_applyModelToDropdown(sessionModel, sel). If the model is absent from the dropdown and_modelResolutionDeferredis alreadyfalse, it immediately setssel.value = first.value(e.g. Opus 4.6) and persists that wrong model viaPOST /api/session/update._fetchLiveModels()runs in parallel — it takes 100–500ms to fetch the live model list from the provider API. By the time it completes and adds Kimi to the dropdown, the session model on disk has already been corrupted.The existing
_modelResolutionDeferredflag was designed to protect against this, but it only defers untilGET /api/session?resolve_model=1returns — it doesn't account for the live model fetch still being in flight.Fix
Two changes:
_liveModelFetchPendingset — tracks which providers have a live fetch in flight._fetchLiveModels()adds the provider to the set before the fetch and removes it in afinallyblock. WhensyncTopbar()detects a model is missing, it checks this set: if the active provider's fetch is pending, the fallback logic is skipped entirely (nosel.valueoverride, no disk write).Re-apply in
_addLiveModelsToSelect()— after inserting newly fetched models, re-applyS.session.modelif it is now resolvable. This ensures the correct model is applied even ifsyncTopbar()already ran and briefly showed the wrong model in the chip.Testing
model.default: moonshotai/kimi-k2→ load any session → model chip shows Kimi throughout; no incorrect model persisted.Closes #1169