Skip to content

fix(streaming): lock stale stream cleanup (#1533)#1557

Closed
dutchaiagency wants to merge 1 commit intonesquena:masterfrom
dutchaiagency:fix-1533-stale-stream-lock-race
Closed

fix(streaming): lock stale stream cleanup (#1533)#1557
dutchaiagency wants to merge 1 commit intonesquena:masterfrom
dutchaiagency:fix-1533-stale-stream-lock-race

Conversation

@dutchaiagency
Copy link
Copy Markdown
Contributor

Summary

  • guard stale stream cleanup mutations with the existing per-session session lock
  • re-read active_stream_id under that lock and bail if a concurrent chat start already registered a new stream
  • add a deterministic two-thread regression test for the old stale-cleanup clobber window

Fixes #1533.

Practitioner note: this is the same lock-class I fixed today in our agent ops mail sender (dutchaiagency/ai-agent-duo@ec57e9f added the recipient-level send lock), so I used an actual race test here instead of only asserting source shape.

Test plan

  • python -m pytest tests\test_stale_stream_cleanup.py -q -> 6 passed
  • python -m pytest tests\test_stale_stream_cleanup.py tests\test_stale_stream_pending_recovery.py tests\test_stale_empty_session_restore.py -q -> 10 passed

nesquena-hermes pushed a commit that referenced this pull request May 3, 2026
The PR title and body correctly say 'Closes #1558' but every code comment,
the test file name, error-message strings, docstrings, and the original
commit body referenced #1557 instead. Independent reviewer flagged this:

> The 17 wrong references won't auto-close issue #1558 from the commit
> message — and the test file name will be misleading for future archeology.
> Worth a one-pass s/#1557/#1558/g (and rename test file →
> test_metadata_save_wipe_1558.py) before merge so the artifacts agree
> with reality.

This commit:
- Renames tests/test_metadata_save_wipe_1557.py → test_metadata_save_wipe_1558.py
- Replaces 17 #1557 references with #1558 across:
  - tests/test_metadata_save_wipe_1558.py (7 refs)
  - api/models.py (5 refs in Session.save guard + backup safeguard comments)
  - api/routes.py (2 refs in _clear_stale_stream_state docstring + log)
  - api/session_recovery.py (3 refs)
  - server.py (3 refs in startup self-heal block)

Verified: 6/6 tests in tests/test_metadata_save_wipe_1558.py pass
with the renamed file + updated references.
@nesquena-hermes
Copy link
Copy Markdown
Collaborator

Shipped in v0.50.284 — thanks @dutchaiagency, the lock-and-re-read approach plus the deterministic two-thread regression test were both exactly right.

The PR shows as CLOSED, merged=null rather than MERGED because of a known GitHub quirk we hit periodically (rebased-pr-close-gotcha): when a maintainer-side commit lands on the stage branch before the release merges, GitHub's auto-close fires correctly but doesn't tag the merge state. Your commits are in master with their original authorship intact — verifiable via git log --author=dutchaiagency on master.

For context on what landed alongside #1557 in v0.50.284: a P0 data-loss hotfix (#1559, closes #1558) that shared the same _clear_stale_stream_state function. Both fixes layered correctly — #1558's metadata-only check runs FIRST (don't even acquire the agent lock for a stub) and your #1557 lock-and-re-read wraps the legitimate-mutation path. Pre-release Opus advisor verified the order is correct and there's no deadlock risk.

Then immediately found out v0.50.284's recovery half didn't actually fire in production (caught a _index.json shape mismatch crashing the scanner), so v0.50.285 shipped same-day to fix that. Verified live: recovery now scans 144 sessions cleanly, zero crashes.

Tag: https://github.com/nesquena/hermes-webui/releases/tag/v0.50.284

Thanks again — solid contribution.

pull Bot pushed a commit to soitun/hermes-webui that referenced this pull request May 3, 2026
bsgdigital pushed a commit to fox-in-the-box-ai/hermes-webui that referenced this pull request May 4, 2026
…sation history

v0.50.279 introduced api.routes._clear_stale_stream_state() (nesquena#1525) which
calls session.save() to clear stale active_stream_id/pending_* fields. The
helper is called from /api/session and /api/session/status — both of which
load the session with metadata_only=True. Session.load_metadata_only()
synthesizes a stub with messages=[] (its whole purpose: fast metadata read
without parsing the 400KB+ messages array). Session.save() unconditionally
writes self.messages to disk via os.replace(), so saving a metadata-only
stub atomically overwrites the on-disk JSON with messages=[], wiping the
entire conversation.

Production trigger: every SSE reconnect cycle after a server restart polls
/api/session/status, which fans out to _clear_stale_stream_state, which
saves the metadata-only stub. The user reported losing 1000+ message
conversations and seeing 'Reconnecting…' loops on every prompt — the
reconnect loop kept the cycle running until the conversation was empty.

Fix: three layers, defense in depth.

(1) api/models.py: load_metadata_only() now sets _loaded_metadata_only=True
    on the returned stub. Session.save() raises RuntimeError if that flag
    is set — a hard guard so any future caller making the same mistake
    cannot wipe data, only crash visibly.

(2) api/routes.py: _clear_stale_stream_state() now detects the metadata-only
    flag and re-loads the full session with metadata_only=False before
    mutating persisted state. The full-load path also runs
    _repair_stale_pending() which independently clears the stream flags,
    so the explicit clear becomes a no-op in most cases — but messages
    stay intact.

(3) api/models.py + api/session_recovery.py: every save() that would
    SHRINK the messages array (the precise failure shape of nesquena#1557) first
    snapshots the previous file to <sid>.json.bak. Server.py runs
    recover_all_sessions_on_startup() at boot — any session whose live
    JSON has fewer messages than its .bak is restored automatically.
    Idempotent on clean state. Backup overhead is zero on the normal
    grow-the-conversation path.

Reproducer (master): test_metadata_only_save_does_not_wipe_messages goes
from 1000 messages to 0 in a single save() call. After the fix, 1000
messages survive.

Tests: 6 new regression tests in tests/test_metadata_save_wipe_1557.py
covering all three layers. Full pytest: 4019 → 4025 (+6, all green).

Live verified on port 8789: write 1000-msg session with stale active_stream_id,
hit /api/session/status, /api/session — file ends with 1002 messages
(_repair_stale_pending injects an error-marker pair on full reload, harmless
existing behavior), active_stream_id cleared, pending cleared, no Reconnecting
loop.

Closes nesquena#1557.

Reported by AvidFuturist via user feedback on v0.50.282.
bsgdigital pushed a commit to fox-in-the-box-ai/hermes-webui that referenced this pull request May 4, 2026
… review feedback)

The PR title and body correctly say 'Closes nesquena#1558' but every code comment,
the test file name, error-message strings, docstrings, and the original
commit body referenced nesquena#1557 instead. Independent reviewer flagged this:

> The 17 wrong references won't auto-close issue nesquena#1558 from the commit
> message — and the test file name will be misleading for future archeology.
> Worth a one-pass s/nesquena#1557/nesquena#1558/g (and rename test file →
> test_metadata_save_wipe_1558.py) before merge so the artifacts agree
> with reality.

This commit:
- Renames tests/test_metadata_save_wipe_1557.py → test_metadata_save_wipe_1558.py
- Replaces 17 nesquena#1557 references with nesquena#1558 across:
  - tests/test_metadata_save_wipe_1558.py (7 refs)
  - api/models.py (5 refs in Session.save guard + backup safeguard comments)
  - api/routes.py (2 refs in _clear_stale_stream_state docstring + log)
  - api/session_recovery.py (3 refs)
  - server.py (3 refs in startup self-heal block)

Verified: 6/6 tests in tests/test_metadata_save_wipe_1558.py pass
with the renamed file + updated references.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

race: _clear_stale_stream_state mutates outside per-session lock (Opus follow-up from v0.50.279)

2 participants