release: v0.50.249#1365
Conversation
- api/streaming.py SSE payload now falls back to agent.model_metadata.get_model_context_length when compressor doesn't supply context_length (mirrors the session-save fallback shipped in v0.50.247). - api/streaming.py also falls back to s.last_prompt_tokens to avoid using the cumulative input_tokens counter. - static/ui.js tracks rawPct separately from pct and shows '(context exceeded)' tooltip when rawPct > 100 instead of misleading '100% used (0% left)'. - static/messages.js clears 'Uploading...' composer status after upload completes. Co-authored-by: nesquena-hermes <[email protected]>
Session.load_metadata_only().compact() was dropping is_cli_session, source_tag, session_source, and source_label, so imported CLI/gateway sessions lost their provenance in sidebar/API payloads. Adds these to METADATA_FIELDS and Session.compact(). Co-authored-by: Dennis Soong <[email protected]>
When a session's compression lineage spans multiple segments (linked via _lineage_root_id from api/agent_sessions.py), the sidebar previously rendered each segment as a separate top-level row. Adds _collapseSessionLineageForSidebar() that groups by lineage root and keeps only the most recently active tip per group, with a _lineage_collapsed_count marker for future UI affordances. Co-authored-by: Dennis Soong <[email protected]>
Adds a 'storage' event listener for the hermes-webui-session localStorage key. Idle tabs auto-load the new active session and re-render the sidebar; busy tabs show a toast and do not interrupt the active turn. Co-authored-by: Dennis Soong <[email protected]>
#1355) Replaces the 1.5s HTTP polling loop for clarify with a Server-Sent Events endpoint at /api/clarify/stream that pushes clarify events to the browser instantly. Mirrors the approval SSE pattern from v0.50.248 (#1350) including all the correctness lessons: - Atomic subscribe + initial snapshot under clarify._lock - _clarify_sse_notify called inside _lock for ordering guarantees (no notify-out-of-order race) - Notify passes head=q[0].data (head-fidelity, not the just-appended entry) - resolve_clarify also calls notify after pop so trailing clarifies surface immediately (no stuck-clarify bug) - Empty-state notify with None,0 after pop-empty so frontend hides the card - 30s keepalive comments, _CLIENT_DISCONNECT_ERRORS handling - Bounded queue (maxsize=16) with silent drop on full - Frontend: EventSource with automatic 3s HTTP polling fallback on onerror Co-authored-by: fxd-jason <[email protected]>
Bundles 5 community PRs: - #1355 feat(clarify): SSE long-connection (mirrors #1350 pattern, includes all correctness lessons) - #1356 fix: context window indicator overflow (live SSE fallback) + uploading status clear - #1357 fix: preserve imported session source metadata - #1358 fix: collapse sidebar session lineage rows - #1359 fix: sync active session across tabs Tests: 3444 passing (3411 -> 3444, +33)
…ndler The new _handle_clarify_sse_stream handler in #1355 holds clarify._lock and then calls clarify.get_pending(sid) under the lock. get_pending also acquires _lock internally — and clarify._lock is a non-reentrant threading.Lock(), so the second acquisition deadlocks the SSE handler thread the moment any client connects to /api/clarify/stream. Existing tests pass because they only exercise sse_subscribe, sse_unsubscribe, _clarify_sse_notify, and submit_pending directly — none of them invoke the route handler. The deadlock would only manifest when a real EventSource opens the connection. Reproduced with a tiny harness that holds _lock and calls get_pending: the worker thread is still blocked after a 2s timeout. With the fix, both empty and populated queue cases complete in <1ms. Fix: read clarify._gateway_queues / clarify._pending inline under the same _lock acquisition, mirroring the approval SSE handler's pattern at api/routes.py:2785-2793. No recursive lock; head-of-queue snapshot is identical to what get_pending would have returned. Added tests/test_pr1355_sse_handler_no_deadlock.py with three tests: - behavioural: empty queue snapshot completes within 2s - behavioural: populated queue snapshot returns the head entry - source-level invariant: routes.py must not call get_clarify_pending() inside `with _clarify_lock:` block (locks the regression in) Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
nesquena
left a comment
There was a problem hiding this comment.
Review — end-to-end ⚠️ → ✅ (deadlock caught and fixed; pushed 604b44a)
Release v0.50.249 — bundle of 5 community PRs (plus #1356 already approved separately). The trace caught a production-crashing deadlock in the new clarify SSE handler (#1355). Fix pushed; release now safe to ship.
Squash audit
bbdacdc #1356 fix: context window indicator overflow (already approved on #1356)
70dac01 #1357 fix: preserve imported session source metadata @dso2ng
f13230f #1358 fix: collapse sidebar session lineage rows @dso2ng
6a73680 #1359 fix: sync active session across tabs @dso2ng
d2d464a #1355 feat(clarify): SSE long-connection @fxd-jason
36d87f5 release: v0.50.249 (CHANGELOG only)
604b44a fix(clarify-sse): inline snapshot under _lock to avoid deadlock ← my fix
All 5 contributor commits preserve Author: + Co-authored-by: trailers. ✅
What I caught — clarify SSE handler deadlocks the moment any client connects
PR #1355 ships a new _handle_clarify_sse_stream route handler at api/routes.py:2888-2935. The original code:
with _clarify_lock: # acquire _lock
_clarify_subs.setdefault(sid, []).append(q)
initial_pending = get_clarify_pending(sid) # ← calls clarify.get_pending
initial_count = 1 if initial_pending else 0def get_pending(session_key: str) -> dict | None:
with _lock: # ← re-acquires the SAME lock
queue = _gateway_queues.get(session_key) or []
if queue:
return dict(queue[0].data)
...clarify._lock = threading.Lock() is non-reentrant (api/clarify.py:16). The same thread calling acquire() twice deadlocks indefinitely. The first EventSource connection to /api/clarify/stream would have hung the SSE handler thread forever.
The existing test suite missed this because all 29 of tests/test_clarify_sse.py's tests exercise sse_subscribe/sse_unsubscribe/_clarify_sse_notify/submit_pending directly — none of them invoke the route handler. CI was green because the deadlock only manifests when a real EventSource opens the connection.
Reproduced with a 30-line harness:
DEADLOCK CONFIRMED: thread is blocked after 2s
After the fix, both empty-queue and populated-queue snapshots complete in <1ms.
What I pushed — 604b44a
api/routes.py— replaced theget_clarify_pending(sid)call with inline reads of_gateway_queuesand_pendingunder the same_lockacquisition. Mirrors the approval SSE handler's pattern at api/routes.py:2785-2793. Returns the same head-of-queue dict thatget_pendingwould have returned.tests/test_pr1355_sse_handler_no_deadlock.py— 3 new tests:test_handler_snapshot_does_not_deadlock_when_queue_is_empty— runs the handler's snapshot logic in a worker thread with a 2s timeout; asserts no deadlock and(None, 0).test_handler_snapshot_does_not_deadlock_when_queue_has_entry— primes the queue viasubmit_pending, runs the snapshot, asserts the head entry is captured.test_routes_handler_does_not_call_get_pending_under_lock— source-level invariant: parses_handle_clarify_sse_stream, walks thewith _clarify_lock:block by indentation, assertsget_clarify_pending(is NOT inside it. Locks the regression in.
Traced against upstream hermes-agent
agent.model_metadata.get_model_context_length (used in #1356 fallback) and register_gateway_notify / unregister_gateway_notify (consumed by clarify SSE) verified at /tmp/hermes-agent-fresh/tools/approval.py:382-401. The new is_cli_session/source_tag/session_source/source_label fields from #1357 are webui-only — upstream uses source_label in unrelated contexts (skills hub, web_server credentials), not as Session attributes. ✅ Cross-tool clean.
End-to-end trace — clarify SSE (post-fix)
- Client connects to
/api/clarify/stream?session_id=…(static/messages.js:1681). - Auth gate via
check_auth(self, parsed)in server.py:76 beforehandle_get. - Route dispatch api/routes.py:1274 →
_handle_clarify_sse_stream. - Atomic subscribe + snapshot under
_clarify_lock: append queue, read_gateway_queues[sid][0].data(or fall back to_pending[sid]if list-shape isn't there), release lock. - SSE response headers + initial
_sse(handler, 'initial', …). q.get(timeout=30)loop with keepalive comments;_CLIENT_DISCONNECT_ERRORS+finally:unsubscribe on close.- submit_pending at api/clarify.py:138 calls
_clarify_sse_notifyfrom inside_lock— head-of-queuedict(gw_queue[0].data), countlen(gw_queue). Notify-ordering preserved. - resolve_clarify at api/clarify.py:170-175 — pops entry, then if queue has more, notifies new head; if queue empty, notifies
(None, 0)so frontend hides the card. Trailing-clarify-stuck bug solved.
This pattern incorporates all three v0.50.248 #1350 lessons (notify under lock, head-of-queue fidelity, post-pop re-emit) — exactly as the PR description claims. The only gap was the recursive lock in the handler, which my fix closes.
End-to-end trace — context indicator (#1356)
Already traced in PR #1356 review. Identical content; behavioural harness re-confirmed.
End-to-end trace — session metadata (#1357)
is_cli_session, source_tag, session_source, source_label added to:
- api/models.py:350-353 —
Session.__init__reads from kwargs (backwards-compatible defaultFalse/None). - api/models.py:374 —
METADATA_FIELDSround-trip (save → load_metadata_only → still present). - api/models.py:470-473 —
compact()returns them in sidebar/API payloads.
Population sites:
- api/routes.py:4448 sets
is_cli_session=Truefor imported CLI sessions (pre-existing). - The other three fields (
source_tag,session_source,source_label) aren't actively populated yet by webui code — they're plumbing for future PRs. Not a regression because previously they were never read or written either.
CLI: doesn't touch these fields. ✅
End-to-end trace — lineage collapse (#1358)
Pure client-side at static/sessions.js:1219-1234. _collapseSessionLineageForSidebar groups by _lineage_root_id/lineage_root_id/parent_session_id, picks the latest tip per group by _sessionTimestampMs, attaches _lineage_collapsed_count for future UI affordances (currently set but not consumed — fine).
Edge cases:
- Empty input →
[]. ✅ - Single session, no lineage → kept as-is. ✅
- Two sessions same lineage → latest kept, count=2. ✅
- Mix of lineage and non-lineage sessions → non-lineage retained verbatim, lineage groups collapsed. ✅
- Iteration order:
Mappreserves insertion order, so non-lineage sessions appear before lineage groups. The caller's downstream sort should re-order by timestamp.
End-to-end trace — cross-tab sync (#1359)
static/sessions.js:1631-1645. storage event listener fires when ANOTHER tab sets localStorage['hermes-webui-session']. The handler:
- Skip if not the right key, or no value, or already on this session.
- Skip and toast if
S.busy— never interrupt an active turn. ✅ - Otherwise
await loadSession(sid)+ re-render sidebar.
Backed by the regression test test_storage_sync_does_not_switch_while_busy which asserts the if(S.busy) guard exists.
Cross-tool consistency
- Clarify SSE: webui-only mechanism (
_clarify_sse_subscribersis module-level state). Upstreamtools.approval.gateway_notifyalready pushed clarify events via the chat/stream SSE; this is an additive secondary path. ✅ - Session metadata fields: webui-only Session attributes; CLI uses state.db, doesn't read these. ✅
- Lineage / cross-tab: JS only, frontend-internal. ✅
- Context indicator (#1356): already cleared in prior review. ✅
- No
config.yamlwrites. ✅
Race / lock analysis
- Clarify SSE handler (post-fix): subscribe + snapshot atomic under
_clarify_lock. No recursive acquisition. No notify-ordering race (notify called insidesubmit_pending's lock block). Trailing-clarify re-emit handles pop-empty case. _clarify_sse_notifyis called inside_lockat api/clarify.py:138 (submit_pending), api/clarify.py:172 and api/clarify.py:175 (resolve_clarify). It doesq.put_nowait(payload)while holding_lock. Queue's internal mutex is briefly acquired; no inversion (no other path holds queue.mutex while waiting on_clarify_lock). Safe.q.put_nowaiton full queue: catchesqueue.Fulland silently drops. No blocking under the lock. ✅- Cross-tab
storagelistener: runs on the main JS thread, no shared mutable state outsideS. TheS.busyguard prevents mid-turn interruption.
Security audit
- ✅
session_idfromparse_qs— used as dict key only, no path/SQL/shell. - ✅ Auth gated via
check_authinserver.py:76for ALL GET routes. - ✅
_ssepayload viajson.dumps(data)— no XSS. - ✅ Frontend
JSON.parse(ev.data)then property access — no innerHTML. - ✅ Bounded queue (maxsize=16) prevents memory exhaustion from slow clients.
- ✅ Cross-tab listener gated on key match (
'hermes-webui-session') — no untrusted keys leak in. - ✅
loadSession(sid)in cross-tab handler —sidcomes from localStorage which is same-origin, but the API call already validates the session exists (404 returned otherwise). - ✅ Session metadata fields pure data attributes; no executable surface.
Edge-case matrix
| Scenario | Expected | Actual |
|---|---|---|
| First clarify SSE client connects | Subscribe + initial snapshot, no deadlock | DEADLOCK pre-fix |
Empty _gateway_queues snapshot |
initial=None, count=0 | ✅ |
| Populated queue snapshot | head dict + count | ✅ |
submit_pending fires after subscribe |
New head pushed via SSE | ✅ |
resolve_clarify clears the only pending |
Empty (None, 0) event pushed → frontend hides |
✅ |
resolve_clarify with more pending |
New head event pushed → frontend updates | ✅ |
| Dedup case (same question/choices) | Reuses existing entry, no new SSE event | ✅ (initial snapshot still covers new subscribers) |
| Slow client (queue=16 saturated) | queue.Full → silent drop |
✅ |
| Client disconnect mid-stream | _CLIENT_DISCONNECT_ERRORS caught, finally unsubscribes |
✅ |
| EventSource fails on connect | onerror → _startClarifyFallbackPoll (3s polling) |
✅ |
| 60s health timer fires | Forces reconnect | ✅ |
| Session metadata round-trip (save → load) | All 4 new fields preserved | ✅ (test_imported_cli_session_metadata_survives_compact) |
| Lineage with 1 segment | Kept as-is, no count | ✅ |
| Lineage with N segments | Latest kept, _lineage_collapsed_count=N |
✅ |
| Cross-tab session change while idle | Auto-loads new session | ✅ |
| Cross-tab session change during busy turn | Toast shown, current turn not interrupted | ✅ |
Older agent build (no get_model_context_length) for #1356 |
try/except swallows ImportError |
✅ |
Tests
- New regression tests I added: 3/3 pass in
tests/test_pr1355_sse_handler_no_deadlock.py. - PR #1355's own tests: 29/29 pass in
tests/test_clarify_sse.py. - Other PRs:
test_session_lineage_collapse.py1/1,test_session_cross_tab_sync.py2/2,test_gateway_sync.py::test_imported_cli_session_metadata_survives_compactpass. - Full suite: 3395 passed, 54 skipped, 3 xpassed, 0 failed in 16.01s on
604b44a. - CI: Was green on
36d87f5for 3.11/3.12/3.13. Will need to re-run on604b44afor the new commit.
Minor observations (non-blocking)
_lineage_collapsed_countis set but not read anywhere. Future UI affordance — fine, just dead state for now.- Three of the four new metadata fields aren't populated by any current code path (
source_tag,session_source,source_label). PR #1357 only adds the round-trip plumbing — population is presumably a follow-up. Not a regression because they were never populated before either. - PR description claims "3444 passed"; my local count is 3395 (3392 pre-fix baseline + 3 new tests I added). The discrepancy is consistent with prior releases — likely subtest counting drift.
- Clarify dedup case at api/clarify.py:115-131 does NOT call
_clarify_sse_notify. This is correct: the head-of-queue didn't change, so no event is needed for already-subscribed clients. New subscribers still get the head via the initial snapshot. ✅ - 60s health-timer reconnect in clarify frontend at static/messages.js:1715-1722 — unconditionally tears down and re-subscribes every 60s, even if the connection is healthy. This is more aggressive than approval SSE's 5s
S.busyhealth check. Probably fine (60s churn is light), but could be refined to only reconnect if no event received in window. Out of scope.
Recommendation
✅ Approved after fix. The deadlock was a critical regression that would have manifested the first time any tab opened the clarify SSE stream — caught it via static-trace inspection of the _lock acquisition graph (the source-level test test_routes_handler_does_not_call_get_pending_under_lock locks this in). Fix mirrors the approval SSE handler's correct pattern. All other PRs in the bundle (#1356/1357/1358/1359) traced clean.
Parked at approval after 604b44a — ready for the release agent's merge/tag pipeline once CI re-runs green on the new commit.
Opus pre-release reviewVerdict: DO NOT SHIP as originally pushed → Ship after fixes (now applied). Opus caught one critical deadlock + one wasteful timer pattern. Nathan independently caught the same deadlock in parallel work at 🟥 MUST-FIX — Deadlock in
|
5a03ecd to
604b44a
Compare
Reset to approved commit
|
Shipped — v0.50.249 ✅Merge: What shipped
Plus pre-release fix Out-of-scope follow-upPer your guidance, I'm bringing back the stale-detector fix ( Tests
Hygiene complete
Held PRs (NOT in this release, still on hold)
Lesson captured: integration-style behavioral tests with wall-clock timeouts are mandatory for any new long-lived handler that crosses module boundaries — your |
…al reconnect (nesquena#1367) Follow-up to v0.50.249 / PR nesquena#1365 absorbing Opus SHOULD-FIX nesquena#2. Originally reset out of nesquena#1365 because the reviewer flagged it as out-of-scope; brought back per follow-up guidance that correctness-improving changes should ship even when out of scope. The clarify SSE health timer at static/messages.js:1715 was an unconditional 60s force-reconnect, not the 'no event in 60s' detector its comment claimed. Now actually a stale-detector that tracks lastEventAt on initial+clarify event arrivals; only reconnects when the gap exceeds 60s. Under healthy conditions the timer never fires. Co-authored-by: nesquena-hermes <[email protected]>
…al reconnect (nesquena#1367) Follow-up to v0.50.249 / PR nesquena#1365 absorbing Opus SHOULD-FIX nesquena#2. Originally reset out of nesquena#1365 because the reviewer flagged it as out-of-scope; brought back per follow-up guidance that correctness-improving changes should ship even when out of scope. The clarify SSE health timer at static/messages.js:1715 was an unconditional 60s force-reconnect, not the 'no event in 60s' detector its comment claimed. Now actually a stale-detector that tracks lastEventAt on initial+clarify event arrivals; only reconnects when the gap exceeds 60s. Under healthy conditions the timer never fires. Co-authored-by: nesquena-hermes <[email protected]>
Release v0.50.249
Bundle of five community PRs. #1356 was already independently approved before batching.
Constituent PRs
Pre-release gate
Author:+Co-authored-by:trailerf02df28with end-to-end trace, race/lock analysis, edge-case matrix, and security auditproc_03b1d22b0743)Why this release is lower-risk than v0.50.248
The most substantial PR (#1355 — 400 LOC SSE feature) already incorporates all three Opus MUST-FIX lessons from #1350 / v0.50.248:
_lockin bothsubmit_pendingandresolve_clarify— no notify-ordering raceq[0].datahead-of-queue, not the just-appended entry — head-fidelityresolve_clarifyre-emits the new head after pop (withNone/0empty-state when queue is empty) — no trailing-clarify-stuck bugwith _lock:block — no snapshot/subscribe raceThis is exactly the pattern @fxd-jason picked up from the v0.50.248 #1350 release-comment debrief. Excellent absorption of feedback.
Plan after approval
--merge, no squash — preserves all 6 contributor commits)v0.50.249on the merge commit/tmp/wt-v0.50.249Held PRs (NOT in this release)
Compatibility notes
/api/clarify/streamSSE endpoint is additive —/api/clarify/pendingpolling endpoint stays in place as the frontend's automatic fallback.Session.__init__now accepts 4 new metadata kwargs (is_cli_session,source_tag,session_source,source_label) — backwards-compatible; existing sessions without those fields default toNone/False.storage-event listener for cross-tab sync is gated onS.busyso active turns are never interrupted.