Skip to content

feat(clarify): SSE long-connection for real-time clarify notifications#1355

Closed
jasonjcwu wants to merge 1 commit intonesquena:masterfrom
jasonjcwu:feat/clarify-sse
Closed

feat(clarify): SSE long-connection for real-time clarify notifications#1355
jasonjcwu wants to merge 1 commit intonesquena:masterfrom
jasonjcwu:feat/clarify-sse

Conversation

@jasonjcwu
Copy link
Copy Markdown
Contributor

Replaces the 1.5s HTTP polling loop for clarify with a Server-Sent Events endpoint at `/api/clarify/stream` that pushes clarify events to the browser instantly.

What changed

  • Backend: SSE subscribe/unsubscribe/notify lifecycle in `api/clarify.py`
    • `sse_subscribe(session_id)` — register a bounded Queue for a session
    • `sse_unsubscribe(session_id, queue)` — cleanup on disconnect
    • `_clarify_sse_notify(session_id, head, total)` — push event to all subscribers
    • `submit_pending()` calls `_clarify_sse_notify()` inside `_lock` for ordering guarantees
    • `resolve_clarify()` calls `_clarify_sse_notify()` after pop (surfaces trailing clarifies)
  • New route: `GET /api/clarify/stream?session_id=`
    • Atomic subscribe + initial snapshot under `clarify._lock`
    • 30s keepalive comments, `_CLIENT_DISCONNECT_ERRORS` handling
  • Frontend: `startClarifyPolling()` in `static/messages.js`
    • EventSource SSE as primary transport
    • Automatic fallback to 3s HTTP polling on SSE error
    • Health timer detects stale connections (60s reconnect)
  • 29 new tests (static analysis + unit + concurrency)

Follows the same SSE pattern as the approval stream (#1350), including the notify-ordering and head-fidelity fixes from `e68f74a`.

Testing

All 75 SSE-related tests pass (42 approval + 29 clarify + 4 regression):

  • `tests/test_approval_sse.py` — 42 passed
  • `tests/test_clarify_sse.py` — 29 passed
  • `tests/test_pr1350_sse_notify_correctness.py` — 4 passed

Replaces the 1.5s HTTP polling loop with a Server-Sent Events endpoint
at /api/clarify/stream that pushes clarify events to the browser instantly.

- Backend: SSE subscribe/unsubscribe/notify lifecycle in api/clarify.py
  - sse_subscribe(session_id) — register a bounded Queue
  - sse_unsubscribe(session_id, queue) — cleanup on disconnect
  - _clarify_sse_notify(session_id, head, total) — push to all subscribers
  - submit_pending() calls _clarify_sse_notify() inside _lock
  - resolve_clarify() calls _clarify_sse_notify() after pop (trailing approval fix)
- New route: GET /api/clarify/stream?session_id=
  - Atomic subscribe + initial snapshot under clarify._lock
  - 30s keepalive comments, _CLIENT_DISCONNECT_ERRORS handling
- Frontend: EventSource SSE as primary + 3s HTTP polling fallback
- 29 new tests (static analysis + unit + concurrency)
@nesquena-hermes
Copy link
Copy Markdown
Collaborator

Thanks @jasonjcwu, this is a clean port of the SSE pattern from #1350 to clarify, and it's exactly where I'd want this to land next. A few review notes:

Design — looks right

  • Subscribe-under-lock + initial snapshot under clarify._lock is the same correctness pattern from e68f74a. Good.
  • _clarify_sse_notify() called inside _lock in submit_pending() and after pop in resolve_clarify() mirrors the approval ordering guarantees. The trailing-clarify case is the easy one to miss; nice that it's covered.
  • 30s keepalive + 3s HTTP polling fallback matches what we already shipped. Behaviourally identical from the user's POV when SSE is healthy, gracefully degrades when it isn't.

Questions / nits

  1. Bounded Queue — what's the size, and what's the behaviour when full? For approval streams this was a non-issue because notifications are infrequent, but clarify can fire bursts (multi-tool plans that ask in rapid succession). If the queue drops silently, the head/total recompute on next event will still self-heal, but worth confirming the test suite exercises a fast-burst case.
  2. The frontend function is still called startClarifyPolling() — fine for the diff but consider renaming in a follow-up to startClarifyStream() so future readers don't grep for "polling" and miss it.
  3. Session-switching: if the user switches between two active sessions quickly, do we close the old EventSource before opening a new one? A leak here would manifest only under heavy multi-tab use, but is worth a one-line check.

Tests
75 SSE tests passing across the two streams + regression suite is the right bar. The tests/test_pr1350_sse_notify_correctness.py regression coverage carrying forward is exactly what I want to see — proves the ordering fix isn't being undone.

Once the queue-full behaviour is documented in a comment or test, this is good to merge from my side. Will do a closer line-level pass this week.

@nesquena-hermes nesquena-hermes mentioned this pull request Apr 30, 2026
6 tasks
nesquena added a commit that referenced this pull request Apr 30, 2026
…ndler

The new _handle_clarify_sse_stream handler in #1355 holds clarify._lock and
then calls clarify.get_pending(sid) under the lock. get_pending also acquires
_lock internally — and clarify._lock is a non-reentrant threading.Lock(),
so the second acquisition deadlocks the SSE handler thread the moment any
client connects to /api/clarify/stream.

Existing tests pass because they only exercise sse_subscribe, sse_unsubscribe,
_clarify_sse_notify, and submit_pending directly — none of them invoke the
route handler. The deadlock would only manifest when a real EventSource opens
the connection.

Reproduced with a tiny harness that holds _lock and calls get_pending: the
worker thread is still blocked after a 2s timeout. With the fix, both empty
and populated queue cases complete in <1ms.

Fix: read clarify._gateway_queues / clarify._pending inline under the same
_lock acquisition, mirroring the approval SSE handler's pattern at
api/routes.py:2785-2793. No recursive lock; head-of-queue snapshot is
identical to what get_pending would have returned.

Added tests/test_pr1355_sse_handler_no_deadlock.py with three tests:
- behavioural: empty queue snapshot completes within 2s
- behavioural: populated queue snapshot returns the head entry
- source-level invariant: routes.py must not call get_clarify_pending()
  inside `with _clarify_lock:` block (locks the regression in)

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
nesquena-hermes added a commit that referenced this pull request Apr 30, 2026
…al reconnect

The original PR #1355 frontend health timer at 60s force-reconnected every
60s regardless of activity, with a comment claiming it was a 'no event in
60s' detector. This churned one TCP/SSE setup per minute per active
session.

Fix: track lastEventAt on initial+clarify event arrivals; only reconnect
when the gap exceeds 60s. The server sends keepalive comments every 30s,
but EventSource silently consumes those — we can only bump lastEventAt on
real application events. Under healthy conditions the timer never fires.

Sits on top of nesquena's 604b44a deadlock fix (parallel work — Nathan
caught the same Opus MUST-FIX before my push landed). Opus SHOULD-FIX #2
from the v0.50.249 pre-release review.
nesquena-hermes pushed a commit that referenced this pull request Apr 30, 2026
#1355)

Replaces the 1.5s HTTP polling loop for clarify with a Server-Sent Events endpoint at /api/clarify/stream that pushes clarify events to the browser instantly. Mirrors the approval SSE pattern from v0.50.248 (#1350) including all the correctness lessons:

- Atomic subscribe + initial snapshot under clarify._lock
- _clarify_sse_notify called inside _lock for ordering guarantees (no notify-out-of-order race)
- Notify passes head=q[0].data (head-fidelity, not the just-appended entry)
- resolve_clarify also calls notify after pop so trailing clarifies surface immediately (no stuck-clarify bug)
- Empty-state notify with None,0 after pop-empty so frontend hides the card
- 30s keepalive comments, _CLIENT_DISCONNECT_ERRORS handling
- Bounded queue (maxsize=16) with silent drop on full
- Frontend: EventSource with automatic 3s HTTP polling fallback on onerror

Co-authored-by: fxd-jason <[email protected]>
nesquena-hermes added a commit that referenced this pull request Apr 30, 2026
Bundles 5 community PRs:
- #1355 feat(clarify): SSE long-connection (mirrors #1350 pattern, includes all correctness lessons)
- #1356 fix: context window indicator overflow (live SSE fallback) + uploading status clear
- #1357 fix: preserve imported session source metadata
- #1358 fix: collapse sidebar session lineage rows
- #1359 fix: sync active session across tabs

Tests: 3444 passing (3411 -> 3444, +33)
@nesquena-hermes
Copy link
Copy Markdown
Collaborator

Shipped in v0.50.249 (merge d72399ae, tag https://github.com/nesquena/hermes-webui/releases/tag/v0.50.249) via batch release PR #1365.

Production verified live at port 8787, ?v=v0.50.249 cache-bust active.

The Clarify SSE feature you implemented landed cleanly with one MUST-FIX caught in pre-release review: the handler called clarify.get_pending(sid) while holding clarify._lock, but get_pending() itself does with _lock: — and _lock = threading.Lock() is non-reentrant. Every SSE connection deadlocked the server thread indefinitely.

This was caught by Opus + Nathan in parallel work (Nathan pushed the inline-snapshot fix at 604b44a). The behavioural test pattern at tests/test_pr1355_sse_handler_no_deadlock.py will catch this class of bug for any future SSE handler.

The rest of your implementation absorbed all 3 lessons from #1350 / v0.50.248 perfectly:

  • Notify called inside _lock in both submit_pending and resolve_clarify
  • Payload uses q[0].data (head, not tail)
  • resolve_clarify re-emits new head after pop, and pushes (None, 0) on empty so the UI hides the card

Excellent contribution. Already verified live: curl /api/clarify/stream?session_id=… returns event: initial\ndata: {"pending": null, "pending_count": 0}\n\n in under 2 seconds.

Thanks @fxd-jason! 🙏

GeoffBao pushed a commit to GeoffBao/hermes-webui that referenced this pull request May 1, 2026
nesquena#1355)

Replaces the 1.5s HTTP polling loop for clarify with a Server-Sent Events endpoint at /api/clarify/stream that pushes clarify events to the browser instantly. Mirrors the approval SSE pattern from v0.50.248 (nesquena#1350) including all the correctness lessons:

- Atomic subscribe + initial snapshot under clarify._lock
- _clarify_sse_notify called inside _lock for ordering guarantees (no notify-out-of-order race)
- Notify passes head=q[0].data (head-fidelity, not the just-appended entry)
- resolve_clarify also calls notify after pop so trailing clarifies surface immediately (no stuck-clarify bug)
- Empty-state notify with None,0 after pop-empty so frontend hides the card
- 30s keepalive comments, _CLIENT_DISCONNECT_ERRORS handling
- Bounded queue (maxsize=16) with silent drop on full
- Frontend: EventSource with automatic 3s HTTP polling fallback on onerror

Co-authored-by: fxd-jason <[email protected]>
GeoffBao pushed a commit to GeoffBao/hermes-webui that referenced this pull request May 1, 2026
Bundles 5 community PRs:
- nesquena#1355 feat(clarify): SSE long-connection (mirrors nesquena#1350 pattern, includes all correctness lessons)
- nesquena#1356 fix: context window indicator overflow (live SSE fallback) + uploading status clear
- nesquena#1357 fix: preserve imported session source metadata
- nesquena#1358 fix: collapse sidebar session lineage rows
- nesquena#1359 fix: sync active session across tabs

Tests: 3444 passing (3411 -> 3444, +33)
GeoffBao pushed a commit to GeoffBao/hermes-webui that referenced this pull request May 1, 2026
…ndler

The new _handle_clarify_sse_stream handler in nesquena#1355 holds clarify._lock and
then calls clarify.get_pending(sid) under the lock. get_pending also acquires
_lock internally — and clarify._lock is a non-reentrant threading.Lock(),
so the second acquisition deadlocks the SSE handler thread the moment any
client connects to /api/clarify/stream.

Existing tests pass because they only exercise sse_subscribe, sse_unsubscribe,
_clarify_sse_notify, and submit_pending directly — none of them invoke the
route handler. The deadlock would only manifest when a real EventSource opens
the connection.

Reproduced with a tiny harness that holds _lock and calls get_pending: the
worker thread is still blocked after a 2s timeout. With the fix, both empty
and populated queue cases complete in <1ms.

Fix: read clarify._gateway_queues / clarify._pending inline under the same
_lock acquisition, mirroring the approval SSE handler's pattern at
api/routes.py:2785-2793. No recursive lock; head-of-queue snapshot is
identical to what get_pending would have returned.

Added tests/test_pr1355_sse_handler_no_deadlock.py with three tests:
- behavioural: empty queue snapshot completes within 2s
- behavioural: populated queue snapshot returns the head entry
- source-level invariant: routes.py must not call get_clarify_pending()
  inside `with _clarify_lock:` block (locks the regression in)

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants