fix(streaming): lock stale stream cleanup (#1533)#1557
fix(streaming): lock stale stream cleanup (#1533)#1557dutchaiagency wants to merge 1 commit intonesquena:masterfrom
Conversation
The PR title and body correctly say 'Closes #1558' but every code comment, the test file name, error-message strings, docstrings, and the original commit body referenced #1557 instead. Independent reviewer flagged this: > The 17 wrong references won't auto-close issue #1558 from the commit > message — and the test file name will be misleading for future archeology. > Worth a one-pass s/#1557/#1558/g (and rename test file → > test_metadata_save_wipe_1558.py) before merge so the artifacts agree > with reality. This commit: - Renames tests/test_metadata_save_wipe_1557.py → test_metadata_save_wipe_1558.py - Replaces 17 #1557 references with #1558 across: - tests/test_metadata_save_wipe_1558.py (7 refs) - api/models.py (5 refs in Session.save guard + backup safeguard comments) - api/routes.py (2 refs in _clear_stale_stream_state docstring + log) - api/session_recovery.py (3 refs) - server.py (3 refs in startup self-heal block) Verified: 6/6 tests in tests/test_metadata_save_wipe_1558.py pass with the renamed file + updated references.
|
Shipped in v0.50.284 — thanks @dutchaiagency, the lock-and-re-read approach plus the deterministic two-thread regression test were both exactly right. The PR shows as For context on what landed alongside #1557 in v0.50.284: a P0 data-loss hotfix (#1559, closes #1558) that shared the same Then immediately found out v0.50.284's recovery half didn't actually fire in production (caught a Thanks again — solid contribution. |
…sation history v0.50.279 introduced api.routes._clear_stale_stream_state() (nesquena#1525) which calls session.save() to clear stale active_stream_id/pending_* fields. The helper is called from /api/session and /api/session/status — both of which load the session with metadata_only=True. Session.load_metadata_only() synthesizes a stub with messages=[] (its whole purpose: fast metadata read without parsing the 400KB+ messages array). Session.save() unconditionally writes self.messages to disk via os.replace(), so saving a metadata-only stub atomically overwrites the on-disk JSON with messages=[], wiping the entire conversation. Production trigger: every SSE reconnect cycle after a server restart polls /api/session/status, which fans out to _clear_stale_stream_state, which saves the metadata-only stub. The user reported losing 1000+ message conversations and seeing 'Reconnecting…' loops on every prompt — the reconnect loop kept the cycle running until the conversation was empty. Fix: three layers, defense in depth. (1) api/models.py: load_metadata_only() now sets _loaded_metadata_only=True on the returned stub. Session.save() raises RuntimeError if that flag is set — a hard guard so any future caller making the same mistake cannot wipe data, only crash visibly. (2) api/routes.py: _clear_stale_stream_state() now detects the metadata-only flag and re-loads the full session with metadata_only=False before mutating persisted state. The full-load path also runs _repair_stale_pending() which independently clears the stream flags, so the explicit clear becomes a no-op in most cases — but messages stay intact. (3) api/models.py + api/session_recovery.py: every save() that would SHRINK the messages array (the precise failure shape of nesquena#1557) first snapshots the previous file to <sid>.json.bak. Server.py runs recover_all_sessions_on_startup() at boot — any session whose live JSON has fewer messages than its .bak is restored automatically. Idempotent on clean state. Backup overhead is zero on the normal grow-the-conversation path. Reproducer (master): test_metadata_only_save_does_not_wipe_messages goes from 1000 messages to 0 in a single save() call. After the fix, 1000 messages survive. Tests: 6 new regression tests in tests/test_metadata_save_wipe_1557.py covering all three layers. Full pytest: 4019 → 4025 (+6, all green). Live verified on port 8789: write 1000-msg session with stale active_stream_id, hit /api/session/status, /api/session — file ends with 1002 messages (_repair_stale_pending injects an error-marker pair on full reload, harmless existing behavior), active_stream_id cleared, pending cleared, no Reconnecting loop. Closes nesquena#1557. Reported by AvidFuturist via user feedback on v0.50.282.
… review feedback) The PR title and body correctly say 'Closes nesquena#1558' but every code comment, the test file name, error-message strings, docstrings, and the original commit body referenced nesquena#1557 instead. Independent reviewer flagged this: > The 17 wrong references won't auto-close issue nesquena#1558 from the commit > message — and the test file name will be misleading for future archeology. > Worth a one-pass s/nesquena#1557/nesquena#1558/g (and rename test file → > test_metadata_save_wipe_1558.py) before merge so the artifacts agree > with reality. This commit: - Renames tests/test_metadata_save_wipe_1557.py → test_metadata_save_wipe_1558.py - Replaces 17 nesquena#1557 references with nesquena#1558 across: - tests/test_metadata_save_wipe_1558.py (7 refs) - api/models.py (5 refs in Session.save guard + backup safeguard comments) - api/routes.py (2 refs in _clear_stale_stream_state docstring + log) - api/session_recovery.py (3 refs) - server.py (3 refs in startup self-heal block) Verified: 6/6 tests in tests/test_metadata_save_wipe_1558.py pass with the renamed file + updated references.
Summary
active_stream_idunder that lock and bail if a concurrent chat start already registered a new streamFixes #1533.
Practitioner note: this is the same lock-class I fixed today in our agent ops mail sender (
dutchaiagency/ai-agent-duo@ec57e9fadded the recipient-level send lock), so I used an actual race test here instead of only asserting source shape.Test plan
python -m pytest tests\test_stale_stream_cleanup.py -q-> 6 passedpython -m pytest tests\test_stale_stream_cleanup.py tests\test_stale_stream_pending_recovery.py tests\test_stale_empty_session_restore.py -q-> 10 passed