P0 hotfix: metadata-only Session.save() wipes conversation history (#1558)#1559
P0 hotfix: metadata-only Session.save() wipes conversation history (#1558)#15592 commits merged intomasterfrom
Conversation
…istory v0.50.279 introduced api.routes._clear_stale_stream_state() (#1525) which calls session.save() to clear stale active_stream_id/pending_* fields. The helper is called from /api/session and /api/session/status — both of which load the session with metadata_only=True. Session.load_metadata_only() synthesizes a stub with messages=[] (its whole purpose: fast metadata read without parsing the 400KB+ messages array). Session.save() unconditionally writes self.messages to disk via os.replace(), so saving a metadata-only stub atomically overwrites the on-disk JSON with messages=[], wiping the entire conversation. Production trigger: every SSE reconnect cycle after a server restart polls /api/session/status, which fans out to _clear_stale_stream_state, which saves the metadata-only stub. The user reported losing 1000+ message conversations and seeing 'Reconnecting…' loops on every prompt — the reconnect loop kept the cycle running until the conversation was empty. Fix: three layers, defense in depth. (1) api/models.py: load_metadata_only() now sets _loaded_metadata_only=True on the returned stub. Session.save() raises RuntimeError if that flag is set — a hard guard so any future caller making the same mistake cannot wipe data, only crash visibly. (2) api/routes.py: _clear_stale_stream_state() now detects the metadata-only flag and re-loads the full session with metadata_only=False before mutating persisted state. The full-load path also runs _repair_stale_pending() which independently clears the stream flags, so the explicit clear becomes a no-op in most cases — but messages stay intact. (3) api/models.py + api/session_recovery.py: every save() that would SHRINK the messages array (the precise failure shape of #1557) first snapshots the previous file to <sid>.json.bak. Server.py runs recover_all_sessions_on_startup() at boot — any session whose live JSON has fewer messages than its .bak is restored automatically. Idempotent on clean state. Backup overhead is zero on the normal grow-the-conversation path. Reproducer (master): test_metadata_only_save_does_not_wipe_messages goes from 1000 messages to 0 in a single save() call. After the fix, 1000 messages survive. Tests: 6 new regression tests in tests/test_metadata_save_wipe_1557.py covering all three layers. Full pytest: 4019 → 4025 (+6, all green). Live verified on port 8789: write 1000-msg session with stale active_stream_id, hit /api/session/status, /api/session — file ends with 1002 messages (_repair_stale_pending injects an error-marker pair on full reload, harmless existing behavior), active_stream_id cleared, pending cleared, no Reconnecting loop. Closes #1557. Reported by AvidFuturist via user feedback on v0.50.282.
nesquena
left a comment
There was a problem hiding this comment.
Review — end-to-end ✅ (clean approve, with one issue-number inconsistency to fix before merge)
P0 self-built hotfix for issue #1558 (data loss on SSE reconnect after server restart). Three defensive layers + a startup self-heal recovery path. Behavioral harness confirms the bug exists on master and the guard prevents it.
What this ships
- Layer 1 — api/models.py:368-385 —
Session.save()raisesRuntimeErrorif_loaded_metadata_only=True. Loud crash beats silent wipe. - Layer 1 wiring — api/models.py:494-500 —
Session.load_metadata_only()sets_loaded_metadata_only=Trueon the returned stub. - Layer 2 — api/routes.py:344-368 —
_clear_stale_stream_state()detects the metadata-only flag and reloads withmetadata_only=Falsebefore mutating. If the reload fails, bails without clearing rather than wipe (the right asymmetry). - Layer 3 (backup) — api/models.py:411-441 —
Session.save()writes<sid>.json.bakIFF the previous on-disk message count is greater than the incoming one. Asymmetric: zero overhead on the normal grow path, snapshot on every shrink. - Layer 3 (recovery) — new api/session_recovery.py module with
recover_all_sessions_on_startup(),recover_session(), andinspect_session_recovery_status()helpers. Stages restore via tmp + atomic replace. - Startup hook — server.py:113-122 calls
recover_all_sessions_on_startup(SESSION_DIR)afterfix_credential_permissions(). Best-effort, never blocks startup. - tests/test_metadata_save_wipe_1557.py — 6 tests covering all three layers. Mutation-verified to fail on master with
assert 0 == 1000.
Traced against upstream hermes-agent
Webui-only fix. The metadata_only partial-load path is webui's own performance optimization for sidebar polling — the agent CLI doesn't use it. The Session.save() atomicity (via os.replace()) is webui-only too. No cross-tool surface. ✅
End-to-end trace
Pre-fix production flow (the bug):
- User on v0.50.279+ has a 1000-message session with stale
active_stream_idafter a server restart. - Browser SSE reconnect loop polls
/api/session/status?session_id=X(api/routes.py:1837). - Handler calls
get_session(sid, metadata_only=True)→Session.load_metadata_only(sid)returns a stub withmessages=[]. - Handler calls
_clear_stale_stream_state(stub)(api/routes.py:327). - Helper checks
STREAMS(server crashed → empty), sostream_alive=False. Mutatesstub.active_stream_id=None,pending_user_message=None, callsstub.save(). Session.save()writesself.messages=[]to a tmp file, thenos.replace(tmp, path)— the on-disk JSON is now empty.- Next reconnect cycle reads back the empty file. UI shows the loading banner. User sees their conversation shrink to 0 messages in real time.
Post-fix flow (Layer 1 + Layer 2):
1-3. Same as above.
4. _clear_stale_stream_state(stub) checks getattr(session, "_loaded_metadata_only", False) → True.
5. Helper calls get_session(stub.session_id, metadata_only=False) to reload with the full messages array.
6. The full-load path runs _repair_stale_pending() which independently clears the stale stream flags. After reload, active_stream_id=None already → helper returns False without mutating further.
7. If the explicit clear path runs (some flags still set after _repair_stale_pending), it now operates on a Session with the real messages — save() writes them back unchanged.
8. On-disk file preserved. ✅
Layer 3 (defense in depth): Even if Layer 1+2 missed a future variant of this bug shape, Layer 3's .bak snapshot before any shrinking save means the next server start auto-restores from .bak.
Behavioral harness — bug + guard verified
On disk: 1000 messages
Metadata-only stub: messages=0, _loaded_metadata_only=True
GUARD FIRED: Refusing to save metadata-only session 's_repro': would atomically
overwrite on-disk messages with []. Reload with metadata_only=False
before mutating state. See #1557.
On disk after save attempt: 1000 messages
OK: messages preserved
Confirms:
- The
_loaded_metadata_onlyflag is correctly set byload_metadata_only(). - The guard fires with an informative error message before any disk write.
- On-disk messages are preserved after the failed save attempt.
Security audit
- ✅ No XSS, no SSRF, no auth changes — pure data-handling fix.
- ✅ Backup files (
<sid>.json.bak) inherit the same dir + ownership as the session JSON.tmp_path.replace(session_path)is atomic. Recovery usesshutil.copyfile+tmp.replace()for safety. - ✅
recover_all_sessions_on_startupis best-effort with broadexcept Exceptionswallow at the startup level — never blocks server startup. - ✅ The
.bakmechanism stores a JSON copy in the same dir; no new directory or auth surface. - ✅ No regex / no SQL / no path traversal — all paths derived from
session_path.with_suffix().
Race / state analysis
Session.save()runs under the per-session lock (api/models.py:393 — verified by reading existing code). The new backup-before-write step is under that same lock, so the read-existing + write-bak + write-new dance is atomic per session.- Two threads racing on
_clear_stale_stream_statefor the same session: both would individually re-load the full session viaget_session(metadata_only=False), both would call_repair_stale_pendingwhich is idempotent (already-cleared flags → no-op). Safe. - Concurrent backup + restore: the recovery path reads
<sid>.jsonand<sid>.json.bakthen writes<sid>.jsonvia tmp + atomic replace. No interleaving with a concurrentsave()because both go through the same per-session lock. - Startup recovery vs concurrent save: startup recovery runs before the HTTP server starts accepting requests (server.py:113 is before the bind), so no concurrent saves can race with it.
Edge-case trace
| Scenario | Pre-fix | Post-fix |
|---|---|---|
| Metadata-only stub + save() | wipes 1000-msg conversation to 0 ❌ | RuntimeError raised, file preserved ✅ |
/api/session/status SSE reconnect after restart |
wipes conversation per cycle | reload-and-noop ✅ |
/api/session?metadata_only=true GET |
wipes if state cleanup fires | safe ✅ |
| Full-load session normal save | unaffected | unaffected (flag never set) ✅ |
| Save grows messages from 1000 to 1001 | normal | normal — no .bak written (zero overhead) ✅ |
| Save shrinks messages from 1000 to 500 | data loss | .bak written with 1000 ✅ |
| Server restart with shrunken sessions on disk | corrupted state persists | startup self-heal restores from .bak ✅ |
| Clean install, no .bak files | n/a | startup recovery is no-op ✅ |
| Metadata-only stub but full reload fails (file gone) | wipes | bails without clearing — leaves stale flag (correct asymmetry) ✅ |
_repair_stale_pending already cleared the flag during full reload |
n/a (no reload) | helper returns False (no-op), avoids double-clear ✅ |
| Cross-tool: agent CLI reads same session JSON | unaffected (agent doesn't use metadata_only) | unaffected ✅ |
Tests
tests/test_metadata_save_wipe_1557.py— 6/6 pass:test_metadata_only_save_raises_to_prevent_wipe— Layer 1 fires before disk writetest_clear_stale_stream_state_preserves_messages— end-to-end Layer 2test_save_writes_bak_when_messages_shrink— Layer 3 backup fires on shrinktest_save_does_not_write_bak_when_messages_grow— zero overhead on growtest_recover_all_sessions_on_startup_restores_shrunken_session— startup self-healtest_recover_all_sessions_on_startup_is_idempotent_no_op_on_clean_state— clean install no-op
- Full suite: 3972 passed, 55 skipped, 3 xpassed, 0 failed in 79.94s on the PR branch.
- CI: all 3 (3.11/3.12/3.13) green.
Other audit — things that are correct already
- ✅ The "bail rather than wipe" asymmetry in Layer 2 is exactly right — better to leave a stale
active_stream_idthan wipe the conversation. - ✅ The
_repair_stale_pendinginteraction is handled correctly — Layer 2 returns False if the full-load already cleared the flag. - ✅ Reporter attribution clean (user via AvidFuturist relayed from Twitter/X DM).
- ✅ Pre-merge checklist in PR body is unusually thorough and the reproducer test fails on master with a clear assertion message.
Minor observations (non-blocking, but worth fixing before merge)
-
🚨 Issue-number mismatch — code/tests/comments reference
#1557but the actual issue is#1558. PR title and PR body are correct ("Closes #1558"), but every code comment, the test file name (test_metadata_save_wipe_1557.py), the error-message string inSession.save()("See #1557."), the docstring inapi/session_recovery.py, and the commit message body ("Closes #1557.") all reference#1557. PR #1557 is an unrelated open PR by @dutchaiagency. The 17 wrong references won't auto-close issue #1558 from the commit message — and the test file name will be misleading for future archeology. Worth a one-passs/#1557/#1558/g(and rename test file →test_metadata_save_wipe_1558.py) before merge so the artifacts agree with reality. -
existing_msg_count = -1on JSON decode error (api/models.py:425) means a corrupt previous file always triggers backup write. Reasonable defensive choice — preserves a corrupt-but-might-be-recoverable file. Worth a comment that this is the intent (rather than a bug). -
recover_all_sessions_on_startupscans every.jsonin the dir but doesn't filter by extension subtly enough —_index.jsonand any other non-session JSON would be scanned too. They have nomessageskey so_msg_countreturns -1 andrecommend="no_action"(since bak doesn't exist). Harmless waste; flagging for awareness. -
Layer 3 doesn't garbage-collect old
.bakfiles. A long-lived session that occasionally shrinks messages will accumulate.bakfiles (one per shrinking save, overwriting the previous). That's actually fine — only one.bakper session id at any time. But there's no cleanup once the session is deleted. Minor disk-leak; the session file itself is what user state cares about. -
PR description test count says 4019 → 4025. Local I see 3972 passed. Counting drift consistent with prior batches.
Recommendation
✅ Approved. Excellent P0 hotfix with three independent defensive layers, a startup self-heal recovery path, and a behavioral-harness-verified reproducer. Cross-tool safe. The "bail rather than wipe" asymmetry in Layer 2 and the "shrink-only backup" asymmetry in Layer 3 are both the right calls.
The one thing to fix before merge: the #1557 references throughout new code, comments, the test file name, and commit message body should be #1558 to match the actual issue this closes. Doc-only inconsistency, but it'll mislead future archeology and the commit-body "Closes #1557" won't auto-close issue #1558 either. A one-pass find/replace (and test-file rename) clears it cleanly.
Parked at approval — ready for the release agent's merge/tag pipeline. Given P0 severity, recommend the release agent fix the issue-number references in the same merge so the hotfix lands with consistent artifacts.
The PR title and body correctly say 'Closes #1558' but every code comment, the test file name, error-message strings, docstrings, and the original commit body referenced #1557 instead. Independent reviewer flagged this: > The 17 wrong references won't auto-close issue #1558 from the commit > message — and the test file name will be misleading for future archeology. > Worth a one-pass s/#1557/#1558/g (and rename test file → > test_metadata_save_wipe_1558.py) before merge so the artifacts agree > with reality. This commit: - Renames tests/test_metadata_save_wipe_1557.py → test_metadata_save_wipe_1558.py - Replaces 17 #1557 references with #1558 across: - tests/test_metadata_save_wipe_1558.py (7 refs) - api/models.py (5 refs in Session.save guard + backup safeguard comments) - api/routes.py (2 refs in _clear_stale_stream_state docstring + log) - api/session_recovery.py (3 refs) - server.py (3 refs in startup self-heal block) Verified: 6/6 tests in tests/test_metadata_save_wipe_1558.py pass with the renamed file + updated references.
Both flagged by pre-release Opus advisor; both clearly defensive and small enough to absorb in-release per the reviewer-flagged-fix-in-release-not-followup policy. SHOULD-FIX #1 (api/routes.py:_clear_stale_stream_state, ~25 LOC): After the metadata-only reload (nesquena#1559 Layer 2), the local 'session' variable is reassigned to the full-load object but the caller still holds the original metadata-only stub. /api/session then returns the stale active_stream_id at routes.py:1791, causing the frontend to attempt one ghost SSE reconnect before recovering. Fix: capture original_stub at function entry, then patch its in-memory active_stream_id and pending_* fields to None after both the early-return (full-load already cleared) path AND the successful-mutation path. Now the caller's read returns fresh state, no ghost reconnect. SHOULD-FIX #2 (api/models.py:Session.save, ~20 LOC): The .bak write at api/models.py:436 used write_text() which truncates- then-writes — a crash mid-write or concurrent backup-producing save could leave a torn .bak. Recovery defends correctly (JSONDecodeError → returns -1 → 'no_action'), so the failure mode was 'backup lost' not 'spurious restore'. Fix: tmp + os.replace pattern matching the main file write at line 446-453. Now backup either lands cleanly or doesn't land at all. 4026/4026 tests pass post-absorb.
P0 hotfix: metadata-only Session.save() wipes conversation history
Closes #1558.
TL;DR
A user on v0.50.282 reported that prompts vanish, "Reconnecting…" flashes, and a 1000+ message session disappeared. It really did. v0.50.279's
_clear_stale_stream_state()(#1525) callssave()on a metadata-only-loaded Session, which atomically overwrites the on-disk JSON withmessages=[]. Every active conversation on v0.50.279 — v0.50.282 is at risk on the next SSE reconnect after a server restart.This PR ships three defensive layers and a user-recovery path. Recommend shipping as v0.50.284 immediately.
What goes wrong
api/routes.py:233(introduced in #1525):Called from
/api/session(line 1695) and/api/session/status(line 1837), both of which load withmetadata_only=True.Session.load_metadata_only()synthesizes a stub withmessages=[](it's a fast-path that skips parsing the messages array).Session.save()writesself.messagesto disk viaos.replace(). Result: the on-disk JSON gets atomically replaced with an empty messages list.The user's "Reconnecting…" loop is the SSE reconnect handler hitting
/api/session/statusrepeatedly — every cycle was firing the wipe and then re-firing.Fix — three layers
Layer 1: Hard guard in
Session.save()load_metadata_only()now sets_loaded_metadata_only=Trueon the returned stub.save()raisesRuntimeErroron that flag:A future caller making the same mistake hits a loud crash instead of silently wiping data. Tests pin both the flag and the exception.
Layer 2: Fix
_clear_stale_stream_state()itselfWhen handed a metadata-only stub, reload the session with
metadata_only=Falsebefore mutating:The full-load path runs
_repair_stale_pending()which independently clears stale stream flags, so the explicit clear becomes a near-no-op in steady state — but messages stay intact.Layer 3: Backup-before-update + startup self-heal
Per the user's request, defense against future regressions of similar shape.
In
Session.save(), before the atomic write:The asymmetric guard means:
New
api/session_recovery.pymodule handles startup recovery.server.pycallsrecover_all_sessions_on_startup(SESSION_DIR)afterfix_credential_permissions():Idempotent on clean state. The first server start after deploying #1558 will auto-restore any session that was wiped between deploys.
Tests — 6 new in
tests/test_metadata_save_wipe_1557.pytest_metadata_only_save_raises_to_prevent_wipetest_clear_stale_stream_state_preserves_messagestest_save_writes_bak_when_messages_shrinktest_save_does_not_write_bak_when_messages_growtest_recover_all_sessions_on_startup_restores_shrunken_sessiontest_recover_all_sessions_on_startup_is_idempotent_no_op_on_clean_stateFull pytest: 4019 → 4025 (+6 new), all green.
Repro on master before the fix:
Live e2e verification on port 8789
(The +2 messages over the 1000 starting state is
_repair_stale_pendinginjecting a stale-pending error-marker pair on full reload — pre-existing behavior, harmless, locked by the test that asserts>= 1000.)The on-disk file:
Backup correctly skipped because messages grew, not shrank.
Affected versions
The release notes for v0.50.284 should call this out and urge a server restart on first launch (kicks off the startup self-heal).
Why three layers
It's tempting to ship just Layer 2 (fix the one helper) and call it a day. The reason all three matter:
.bakbehind.Reality check on Layer 3 for users on v0.50.282: they've been running the buggy version for ~12 hours. Whether they have
.bakfiles depends on whether their save() shrunk messages — which it did, every single time. Sorecover_all_sessions_on_startup()will likely fire for them on the very first launch of v0.50.284. We should put a banner / toast / log line so they actually notice.Pre-merge checklist
assert 0 == 1000)Reporter
User on Twitter/X DM, relayed by AvidFuturist (May 03 2026):
The reconnect screenshot —
Reconnecting...with a counter — is the data-loss code path firing on every cycle.