fix(#1298): preserve user message on cancel + persist activity-panel expand state#1338
fix(#1298): preserve user message on cancel + persist activity-panel expand state#1338nesquena-hermes wants to merge 1 commit intomasterfrom
Conversation
…expand Two distinct bugs reported by @YanTianlong-01 in #1298 against v0.50.240. ## Issue 2: Stop/Cancel deletes the user's typed message (data loss) When a user clicked Stop while the agent was streaming, `cancel_stream()` in `api/streaming.py` cleared `s.pending_user_message` before the streaming thread had merged the user turn into `s.messages`. The session was saved with neither the pending field nor a corresponding message — the user's typed text was permanently lost from the session JSON, not just the in-memory client copy. Reload didn't recover it. The fix runs inside the existing post-cancel session lock and synthesizes a user turn into `s.messages` from `pending_user_message` when the latest user message isn't already that turn. Content-match guards against double-append when the streaming thread won the race and merged before cancel got the lock. Pending attachments are preserved. ## Issue 1: Activity panel auto-collapses while user is reading it `ensureActivityGroup()` (`static/ui.js`) creates the live activity group with `tool-call-group-collapsed` whenever it's missing, and `finalizeThinkingCard()` force-adds that class on every tool boundary. The user's manual expand state lives only on the DOM class list, so every thinking → tool → thinking transition (which destroys and recreates the group) wipes the expand and snaps the panel shut. Adds a per-turn singleton `_liveActivityUserExpanded` that tracks the user's last explicit toggle (set by an inline `_onLiveActivityToggle()` hook in the summary button's onclick). `ensureActivityGroup()` consults it when re-creating the live group; `finalizeThinkingCard()` skips the force-collapse when the user has explicitly expanded. `clearLiveToolCards()` resets the tracker between turns so the next turn starts at the default collapsed state. ## Tests - `tests/test_issue1298_cancel_and_activity.py` — 9 tests: - 4 server-side: cancel synthesises user message, doesn't double-append, preserves attachments, no-ops when there's no pending - 5 client-side: source-level guards on `_liveActivityUserExpanded`, `ensureActivityGroup`, `finalizeThinkingCard`, the inline onclick, and `clearLiveToolCards` All adjacent suites still pass (test_cancel_interrupt, test_session_sidecar_repair, test_streaming_race_fix, test_regressions). Closes #1298 Co-authored-by: YanTianlong-01 <[email protected]>
nesquena
left a comment
There was a problem hiding this comment.
Review — end-to-end ✅ (clean approve, branch needs rebase before merge)
Heads up — branch is stale
The PR branched off b24b033 (pre v0.50.245). Current master is at 52e1567, which includes the v0.50.245 batch (PR #1334). I attempted a rebase locally:
api/streaming.pyandstatic/ui.jsauto-merge cleanly (the PR's regions don't overlap with #1316/#1322/#1324/#1326's regions).CHANGELOG.mdhas a textual conflict: the PR adds two entries under## [Unreleased]while master now has## [v0.50.245] — 2026-04-30populated. Resolution is mechanical — keep both sections, with the PR's two entries under[Unreleased]and v0.50.245 unchanged below.
Tests pass identically against the un-rebased branch (3247 + 9 new = full suite count for this branch's lineage; rebased onto v0.50.245 the count would be 3266). Reviewer/release agent should rebase before merge.
What this ships
Closes #1298 — two distinct UI/data bugs reported by @YanTianlong-01 against v0.50.240:
- Issue 2 (data loss): Stop/Cancel during streaming wiped the user's typed message from the session JSON.
- Issue 1 (UX): The Activity panel auto-collapsed every time a new tool/thinking event arrived, even after the user had manually expanded it.
Files touched (commit d90a605):
api/streaming.py— +49 lines insidecancel_stream()'s post-cancel session lock (synthesizes a user turn frompending_user_messagebefore the existing field-clear)static/ui.js— +37 / -5 lines:_liveActivityUserExpandedtracker + helpers, restore-on-create, respect-on-finalize, reset-on-cleartests/test_issue1298_cancel_and_activity.py— 326-line new file, 9 testsCHANGELOG.md— Unreleased entries (will conflict on rebase, see above)
Issue verification
- The reporter described both symptoms accurately. The triage comment on #1298 (left by
nesquena-hermes) initially proposed Issue 2 as a client-side rollback bug instatic/messages.jscancel handler. The PR diagnosis is stronger and correct: the data is actually lost from the session JSON server-side, not just from the in-memory client copy. I verified by re-readingcancel_stream()in current master —_cs.pending_user_message = Noneran without first persisting the pending text to_cs.messages, then_cs.save()wrote the cleared state to disk. Page reload after cancel would see neither field set. The PR's server-side fix solves the root cause; a client-side fix would only mask the symptom on the same page load and lose the data on refresh.
End-to-end trace — Issue 2 (cancel data loss)
- Pending write at
/api/chat/start→ api/routes.py:3251-3258: under_get_session_agent_lock(session_id), setss.pending_user_message = msg,s.pending_attachments = attachments, callss.save(). Pending message is durably on disk. - Streaming thread picks up the session, eventually reaches the success path under
_agent_lock(api/streaming.py:1957 → :2152-2155):- Runs
s.messages = _merge_display_messages_after_agent_result(...)(which incorporates the agent'sresult['messages']— the agent appendsuser_msgto its internal list at the start ofrun_conversation, see run_agent.py:10286-10287). - Sets
s.active_stream_id = None,s.pending_user_message = None, thens.save().
- Runs
- Cancel race:
cancel_stream()(api/streaming.py:2471+):- Pops
STREAMS/CANCEL_FLAGS/AGENT_INSTANCESunderSTREAMS_LOCK, callsagent.interrupt(). - Acquires
_get_session_agent_lock(_cancel_session_id)— the same Lock the streaming thread uses. - PR's new block at api/streaming.py:2576-2627: while holding the lock, before clearing pending fields, looks at
_cs.messages[-1]. If the trailing user message doesn't already matchpending_user_message(usinginsubstring match in either direction to tolerate the streaming thread's[Workspace: ...]prefix), synthesizes a user turn with timestamp + attachments and appends. - Clears pending fields, saves session, releases lock.
- Pops
Race outcome by timing:
| Cancel timing | Pre-fix behaviour | Post-fix behaviour |
|---|---|---|
| Beats streaming thread's merge (typical — agent thread blocked in API) | s.messages empty + pending cleared → user message lost from disk |
Recovery synthesizes turn → user message persists |
| Streaming thread merged first | Streaming thread already wrote user turn into s.messages → fine |
Cancel sees pending_user_message is None (streaming thread cleared it under same lock) → recovery short-circuits ✓ |
| Streaming thread crashes/exception before merge | _last_resort_sync_from_core (api/streaming.py:1313-1338) handles via core-transcript sync OR appends recovered turn + error marker |
Same as before — independent safety net still applies |
The recovery is a defensive layer, primarily load-bearing in the typical case (cancel beats merge). Even when the streaming thread eventually runs its merge after cancel released the lock, the merge incorporates the user turn from the agent's result anyway, so the final state is consistent.
End-to-end trace — Issue 1 (activity panel auto-collapse)
Existing behaviour:
ensureActivityGroup()(static/ui.js:2538-2575) creates the group withtool-call-group-collapsedwhenever it's missing, onappendThinking()calls.finalizeThinkingCard()(static/ui.js:4177+) force-addstool-call-group-collapsedon every tool boundary.removeThinking()(static/ui.js:4236-4239) removes the empty live group from the DOM during thinking → tool transitions, so the next thinking event re-entersensureActivityGroupand re-creates collapsed.
PR's fix:
- New module-scope tracker
_liveActivityUserExpanded(undefined/true/false) at static/ui.js:2545. _onLiveActivityToggle(group)reads!group.classList.contains('tool-call-group-collapsed')after class toggle. Only fires fordata-live-tool-call-group="1"(skips programmatic toggles on settled groups)._clearLiveActivityUserIntent()resets toundefined.ensureActivityGroup()re-create branch consults the tracker:Strictif(live && _liveActivityUserExpanded === true) collapsed=false; else if(live && _liveActivityUserExpanded === false) collapsed=true;
=== true/=== falsecorrectly distinguishes "user expanded", "user collapsed", and "user hasn't touched it" (undefined → fall through to default).- Inline
onclicktemplate now invokes_onLiveActivityToggle(g)after the class-toggle line. finalizeThinkingCard()skips force-collapse when_liveActivityUserExpanded === true.clearLiveToolCards()resets the tracker between turns.
Cross-tool consistency
Issue 2 affects only WebUI session state — pending_user_message is a WebUI-only field that the CLI doesn't read or write. Verified: searching /tmp/hermes-agent-fresh/ for pending_user_message returns no hits. So this fix has no cross-tool surface.
Race / lock analysis
cancel_stream()already acquires_get_session_agent_lock(_cancel_session_id)at api/streaming.py:2580 before the existing field-clear. The PR's new recovery code lives inside that same lock acquisition, so it inherits the existing lock ordering (STREAMS_LOCK → release → session lock).- The streaming thread takes locks in the same order at api/streaming.py:1957. No deadlock risk.
- Recovery and field-clear happen atomically from the streaming thread's perspective. Either the streaming thread's merge runs first (cancel sees pending=None, recovery short-circuits) or cancel runs first (recovery synthesizes, streaming thread's later merge overwrites with its own version that also has the user turn from the agent's result).
- The recovery block is wrapped in its own
try/exceptso non-list_cs.messagesshapes (e.g. unit-test Mock sessions, malformed sidecars) can't propagate exceptions and skip the rest of the cleanup. Errors are debug-logged.
Activity-panel expand state-machine harness
user clicks summary _onLiveActivityToggle reads
─────────────────────────────► classList → !contains('collapsed')
classList.toggle('collapsed') → tracker=true (expanded) or false (collapsed)
removeThinking() destroys group
─────────────────────────────► tracker preserved (singleton, not on DOM)
appendThinking() re-enters
ensureActivityGroup() → consults tracker:
true → collapsed=false (restore)
false → collapsed=true (respect collapse)
undef → opts.collapsed default
finalizeThinkingCard() → if tracker===true: SKIP force-collapse
else: force-add 'collapsed'
next turn: clearLiveToolCards() → _clearLiveActivityUserIntent() → undefined
Verified by grep that all five guards exist in static/ui.js. ✅
Edge-case matrix
| # | Scenario | Expected | Actual |
|---|---|---|---|
| 1 | Cancel before agent runs (pre-flight) | Recovery synthesizes user turn | ✅ test 1 |
| 2 | Cancel after streaming thread merged | No double-append | ✅ test 2 + lock atomicity (streaming thread cleared pending under same lock) |
| 3 | Cancel with attachments | Recovered turn preserves filenames | ✅ test 3 |
| 4 | Cancel with no pending (post-result) | No phantom user turn | ✅ test 4 |
| 5 | Mock session in unit tests | Cleanup completes, no cascade failure | ✅ try/except wrap |
| 6 | User expands → tool boundary | Stays expanded | ✅ _liveActivityUserExpanded === true guard |
| 7 | User collapses → next event | Stays collapsed | ✅ tracker === false branch |
| 8 | Default first-turn state | Starts collapsed | ✅ tracker undefined → opts.collapsed |
| 9 | Turn finalize → next turn | Resets to default | ✅ clearLiveToolCards calls _clearLiveActivityUserIntent |
| 10 | Cross-tool (CLI reads same session) | No change | ✅ pending_user_message is WebUI-only |
Other audit — things that are correct
- ✅ No XSS surface — recovered user turn synthesizes a dict with the verbatim
pending_user_message(the exact text the user typed), no template rendering. Inline onclick template change just addsif(typeof _onLiveActivityToggle==='function')_onLiveActivityToggle(g);— no user input flows through. - ✅ No new endpoints, file paths, or auth gates touched.
- ✅ No
eval/exec/Function()in the diff. - ✅ No secrets/tokens in the diff.
- ✅ Timestamp on recovered turn uses
int(time.time()), consistent with the rest of the codebase's user-message timestamps. - ✅ Attachments preserved correctly — copies
pending_attachmentsonly when it's a list/tuple, defending against malformed sidecars.
Tests
tests/test_issue1298_cancel_and_activity.py— 9/9 pass- 4 server-side tests exercise real
cancel_stream()+ Session save/load round-trips with isolated SESSION_DIR — not just static analysis. Verified the data loss at the disk level. - 5 client-side tests are source-level guards on
static/ui.js(regex/grep checks). Not behavioural — but the changed surface is small enough that the static guards catch the relevant invariants (tracker exists,=== trueguard in finalize, inline onclick references the helper, clear function called from clearLiveToolCards). - Full suite (this branch's lineage): 3247 passed, 54 skipped, 3 xpassed, 0 failed in 19.53s
- This branch has 9 new tests but lacks the +19 tests from PR #1334 (already merged into
master). Rebased onto current master, expect ~3266 passing.
- This branch has 9 new tests but lacks the +19 tests from PR #1334 (already merged into
Minor observations (non-blocking)
- The branch needs a rebase before merge — see top of review. CHANGELOG.md is the only conflict; code regions auto-merge. Rebase belongs to the release agent or maintainer; I haven't pushed.
- Recovery is overwritten in the typical case — when cancel beats merge, recovery runs and saves; when streaming thread later runs its merge, it overwrites
s.messageswith its own list (which also contains the user turn via the agent'sresult['messages']). So the recovery is mostly load-bearing for the narrow case where the streaming thread can't merge (exception path, agent thread death, server restart). The typical-cancel case ends up with a single user turn either way. No bug — just a clarification of where the safety net actually fires. - Multimodal cancel race (very narrow): if a user attaches images and cancel races AFTER the streaming thread's merge ran (rare — streaming thread holds
_agent_lockduring merge, cancel waits), the merged user turn has list-shapecontent(multimodal parts). The recovery's content-match checkisinstance(_last_content, str)would skip the equality test and return_already_persisted = False. However, by then the streaming thread has clearedpending_user_messageunder the same lock, so_pending_user is Noneshort-circuits the entire recovery before the content check runs. Verified safe — flagging in case future refactors decouple "merge" from "clear pending" within the lock. - Triage comment proposed a different fix shape for Issue 2 (client-side
static/messages.jscancel handler). The PR's server-side fix is the better choice — the data was actually missing from disk, not just fromS.messages. The triage comment's second proposed shape (server-side_apply_core_sync_or_error_markerextension) overlaps with the PR's recovery; both end up persisting the recovered turn, just at different points in the cancel flow. The PR's choice (inline incancel_stream) is more localized. - 5 client-side tests are static-only, but the changed JS surface is small enough that the regex guards lock the relevant invariants. A behavioral JSDOM harness could be added in a follow-up but isn't load-bearing here.
Recommendation
✅ Approved. End-to-end trace verified the bug claim (cancel race → pending cleared without persistence → user message lost from session JSON), fix is sound under the existing per-session lock, no double-append risk, no cross-tool surface (pending_user_message is WebUI-only). Activity panel state-tracker correctly handles all toggle/destroy/recreate transitions with proper inter-turn reset. Branch is stale and needs a rebase before merge — code regions auto-merge cleanly, only CHANGELOG.md has a trivial textual conflict.
Parked at approval — ready for the release agent's rebase + merge/tag pipeline.
…us review) Pre-release Opus review on v0.50.246 caught a SHOULD-FIX in PR #1338's cancel_stream synthesis: the symmetric substring guard (_pending_user in _last_content OR _last_content in _pending_user) was too loose. Common confirmation replies ("ok", "yes", "go") in the prior turn would match longer follow-up prompts ("ok please continue"), the synthesis would be skipped, and the user's typed text would be lost — exactly the data-loss bug #1298 was supposed to fix. The fix: gate the substring check on a timestamp comparison. Only treat the latest user turn as 'already merged by the streaming thread' if its timestamp is at or after pending_started_at. Earlier turns whose content happens to be a substring of the pending must not short-circuit synthesis. Also drops the symmetric (_last_content in _pending_user) branch — that direction was the false-positive vector. Keeps the equality and prefix match (workspace-prefix tolerance from the streaming thread). Adds tests/test_issue1298_cancel_and_activity.py:: test_cancel_synthesizes_when_prior_turn_content_is_substring_of_pending — regression for the exact 'ok' → 'ok please continue' scenario.
|
Shipped in v0.50.246 via batch release PR #1343 (merge commit Released alongside 4 other contributor fixes — see the v0.50.246 entry in CHANGELOG.md. |
…-panel expand state (nesquena#1298) From PR nesquena#1338. Already independently APPROVED by nesquena before being absorbed into v0.50.246. CHANGELOG entries from this PR were dropped during squash (the v0.50.245 section is already shipped); they will be re-added under [v0.50.246] in the release commit. Co-authored-by: nesquena-hermes <[email protected]>
Combines: - 4 contributor PRs (nesquena#1335 user fenced code, nesquena#1337 mermaid+cache-bust, nesquena#1339 fallback_providers list, nesquena#1341 context_length persistence) - Self-built nesquena#1338 (cancel data-loss + activity panel) — already independently APPROVED by nesquena before absorption - CONTRIBUTORS.md and markdown refresh from nesquena#1340 See CHANGELOG.md for the full list with author credit.
…-panel expand state (nesquena#1298) From PR nesquena#1338. Already independently APPROVED by nesquena before being absorbed into v0.50.246. CHANGELOG entries from this PR were dropped during squash (the v0.50.245 section is already shipped); they will be re-added under [v0.50.246] in the release commit. Co-authored-by: nesquena-hermes <[email protected]>
Combines: - 4 contributor PRs (nesquena#1335 user fenced code, nesquena#1337 mermaid+cache-bust, nesquena#1339 fallback_providers list, nesquena#1341 context_length persistence) - Self-built nesquena#1338 (cancel data-loss + activity panel) — already independently APPROVED by nesquena before absorption - CONTRIBUTORS.md and markdown refresh from nesquena#1340 See CHANGELOG.md for the full list with author credit.
…us review) Pre-release Opus review on v0.50.246 caught a SHOULD-FIX in PR nesquena#1338's cancel_stream synthesis: the symmetric substring guard (_pending_user in _last_content OR _last_content in _pending_user) was too loose. Common confirmation replies ("ok", "yes", "go") in the prior turn would match longer follow-up prompts ("ok please continue"), the synthesis would be skipped, and the user's typed text would be lost — exactly the data-loss bug nesquena#1298 was supposed to fix. The fix: gate the substring check on a timestamp comparison. Only treat the latest user turn as 'already merged by the streaming thread' if its timestamp is at or after pending_started_at. Earlier turns whose content happens to be a substring of the pending must not short-circuit synthesis. Also drops the symmetric (_last_content in _pending_user) branch — that direction was the false-positive vector. Keeps the equality and prefix match (workspace-prefix tolerance from the streaming thread). Adds tests/test_issue1298_cancel_and_activity.py:: test_cancel_synthesizes_when_prior_turn_content_is_substring_of_pending — regression for the exact 'ok' → 'ok please continue' scenario.
Closes #1298 — both issues, two distinct root causes
@YanTianlong-01 reported two separate UI/data bugs together against v0.50.240. Both are now fixed in one PR with full test coverage.
Issue 2: Stop/Cancel deletes the user's typed message (data loss)
This is a server-side data-loss bug, not just a client display bug. When a user clicked Stop while the agent was streaming,
cancel_stream()cleareds.pending_user_messagebefore the streaming thread had a chance to merge the user turn intos.messages. The session was saved with neither field set, so the user's typed text was permanently lost from the session JSON. Page refresh did not recover it.Root cause path — verified against master in v0.50.244:
api/routes.py:3255—/api/chat/startwritess.pending_user_message = msgand savesagent.run_conversation(). The agent appends the user message to its own internalmessageslist. The streaming thread's_merge_display_messages_after_agent_result()(api/streaming.py:1965) runs only afterrun_conversation()returnscancel_stream()(api/streaming.py:2467+) eagerly popsSTREAMS/AGENT_INSTANCES, callsagent.interrupt(), and races the streaming thread to acquire the per-session lockstreaming.py:2575+clearspending_user_messagewithout persisting it tos.messages. Goodbye, user turn.Fix. Inside the existing post-cancel session lock, before clearing pending state, synthesize a user turn into
s.messagesfrompending_user_messagewhen the most recent user message ins.messagesisn't already that turn. Content-match guards against double-append when the streaming thread won the race. Pending attachments are preserved on the recovered turn. The synthesis is wrapped in its owntry/exceptso unit-test mock sessions can't escape and skip the rest of the cleanup.Issue 1: Activity panel auto-collapses while user is reading it
Root cause — verified against master:
static/ui.js:2541-2542—ensureActivityGroup()creates the live activity group withtool-call-group-collapsedwhenever it's missingstatic/ui.js:4144—finalizeThinkingCard()force-addstool-call-group-collapsedon every tool boundary, regardless of user intentstatic/ui.js:4205—appendThinking()callsensureActivityGroup(blocks, {live:true, collapsed:true, anchor})on every new thinking eventthinking → tool → thinkingtransition (which destroys and recreates the group viaremoveThinking()at:4236-4239) wipes the expand and snaps the panel shutFix. Adds a per-turn singleton
_liveActivityUserExpandedthat captures the user's last explicit toggle. The summary button's inlineonclicknow invokes_onLiveActivityToggle(g)after the class toggle so user clicks update the tracker.ensureActivityGroup()consults the tracker when re-creating the live group;finalizeThinkingCard()skips the force-collapse when the user has explicitly expanded.clearLiveToolCards()resets the tracker between turns so the next turn starts at the default collapsed state.Tests —
tests/test_issue1298_cancel_and_activity.py— 9 newServer-side (4):
test_cancel_synthesizes_user_message_when_messages_empty— the core data-loss casetest_cancel_does_not_double_append_when_streaming_thread_already_merged— race winnertest_cancel_synthesized_user_message_carries_attachments— attachments preservedtest_cancel_no_pending_user_message_does_nothing_extra— no phantom user turnClient-side (5, source-level guards on ui.js):
test_ui_js_tracks_user_expand_intent_for_live_activity_grouptest_ensure_activity_group_restores_expand_intenttest_finalize_thinking_card_respects_user_expandtest_inline_onclick_records_user_intenttest_clear_live_tool_cards_resets_expand_intentTest results
All adjacent suites still green:
test_cancel_interrupt,test_session_sidecar_repair,test_streaming_race_fix,test_regressions,test_sprint51(which uses Mock sessions — passes thanks to the defensive try/except).Files
api/streaming.py— +49 lines incancel_stream()'s post-cancel cleanupstatic/ui.js— +35 lines:_liveActivityUserExpandedtracker,_onLiveActivityTogglehelper, restore-on-create, respect-on-finalize, reset-on-cleartests/test_issue1298_cancel_and_activity.py— new file, 9 testsCHANGELOG.md— Unreleased entriesCo-authored-by: YanTianlong-01 [email protected]