Skip to content

fix(#1298): preserve user message on cancel + persist activity-panel expand state#1338

Closed
nesquena-hermes wants to merge 1 commit intomasterfrom
fix/1298-cancel-message-loss-and-activity-collapse
Closed

fix(#1298): preserve user message on cancel + persist activity-panel expand state#1338
nesquena-hermes wants to merge 1 commit intomasterfrom
fix/1298-cancel-message-loss-and-activity-collapse

Conversation

@nesquena-hermes
Copy link
Copy Markdown
Collaborator

Closes #1298 — both issues, two distinct root causes

@YanTianlong-01 reported two separate UI/data bugs together against v0.50.240. Both are now fixed in one PR with full test coverage.


Issue 2: Stop/Cancel deletes the user's typed message (data loss)

This is a server-side data-loss bug, not just a client display bug. When a user clicked Stop while the agent was streaming, cancel_stream() cleared s.pending_user_message before the streaming thread had a chance to merge the user turn into s.messages. The session was saved with neither field set, so the user's typed text was permanently lost from the session JSON. Page refresh did not recover it.

Root cause path — verified against master in v0.50.244:

  • api/routes.py:3255/api/chat/start writes s.pending_user_message = msg and saves
  • The streaming thread calls agent.run_conversation(). The agent appends the user message to its own internal messages list. The streaming thread's _merge_display_messages_after_agent_result() (api/streaming.py:1965) runs only after run_conversation() returns
  • cancel_stream() (api/streaming.py:2467+) eagerly pops STREAMS/AGENT_INSTANCES, calls agent.interrupt(), and races the streaming thread to acquire the per-session lock
  • If cancel wins the race (typical — the agent thread is blocked in a model API call), the cleanup at streaming.py:2575+ clears pending_user_message without persisting it to s.messages. Goodbye, user turn.

Fix. Inside the existing post-cancel session lock, before clearing pending state, synthesize a user turn into s.messages from pending_user_message when the most recent user message in s.messages isn't already that turn. Content-match guards against double-append when the streaming thread won the race. Pending attachments are preserved on the recovered turn. The synthesis is wrapped in its own try/except so unit-test mock sessions can't escape and skip the rest of the cleanup.


Issue 1: Activity panel auto-collapses while user is reading it

Root cause — verified against master:

  • static/ui.js:2541-2542ensureActivityGroup() creates the live activity group with tool-call-group-collapsed whenever it's missing
  • static/ui.js:4144finalizeThinkingCard() force-adds tool-call-group-collapsed on every tool boundary, regardless of user intent
  • static/ui.js:4205appendThinking() calls ensureActivityGroup(blocks, {live:true, collapsed:true, anchor}) on every new thinking event
  • The user's manual expand state lives only on the DOM class list, so every thinking → tool → thinking transition (which destroys and recreates the group via removeThinking() at :4236-4239) wipes the expand and snaps the panel shut

Fix. Adds a per-turn singleton _liveActivityUserExpanded that captures the user's last explicit toggle. The summary button's inline onclick now invokes _onLiveActivityToggle(g) after the class toggle so user clicks update the tracker. ensureActivityGroup() consults the tracker when re-creating the live group; finalizeThinkingCard() skips the force-collapse when the user has explicitly expanded. clearLiveToolCards() resets the tracker between turns so the next turn starts at the default collapsed state.


Tests — tests/test_issue1298_cancel_and_activity.py — 9 new

Server-side (4):

  • test_cancel_synthesizes_user_message_when_messages_empty — the core data-loss case
  • test_cancel_does_not_double_append_when_streaming_thread_already_merged — race winner
  • test_cancel_synthesized_user_message_carries_attachments — attachments preserved
  • test_cancel_no_pending_user_message_does_nothing_extra — no phantom user turn

Client-side (5, source-level guards on ui.js):

  • test_ui_js_tracks_user_expand_intent_for_live_activity_group
  • test_ensure_activity_group_restores_expand_intent
  • test_finalize_thinking_card_respects_user_expand
  • test_inline_onclick_records_user_intent
  • test_clear_live_tool_cards_resets_expand_intent

Test results

3299 passed, 2 skipped, 3 xpassed, 8 subtests passed in 80.62s

All adjacent suites still green: test_cancel_interrupt, test_session_sidecar_repair, test_streaming_race_fix, test_regressions, test_sprint51 (which uses Mock sessions — passes thanks to the defensive try/except).

Files

  • api/streaming.py — +49 lines in cancel_stream()'s post-cancel cleanup
  • static/ui.js — +35 lines: _liveActivityUserExpanded tracker, _onLiveActivityToggle helper, restore-on-create, respect-on-finalize, reset-on-clear
  • tests/test_issue1298_cancel_and_activity.py — new file, 9 tests
  • CHANGELOG.md — Unreleased entries

Co-authored-by: YanTianlong-01 [email protected]

…expand

Two distinct bugs reported by @YanTianlong-01 in #1298 against v0.50.240.

## Issue 2: Stop/Cancel deletes the user's typed message (data loss)

When a user clicked Stop while the agent was streaming, `cancel_stream()`
in `api/streaming.py` cleared `s.pending_user_message` before the streaming
thread had merged the user turn into `s.messages`. The session was saved
with neither the pending field nor a corresponding message — the user's
typed text was permanently lost from the session JSON, not just the
in-memory client copy. Reload didn't recover it.

The fix runs inside the existing post-cancel session lock and synthesizes
a user turn into `s.messages` from `pending_user_message` when the latest
user message isn't already that turn. Content-match guards against
double-append when the streaming thread won the race and merged before
cancel got the lock. Pending attachments are preserved.

## Issue 1: Activity panel auto-collapses while user is reading it

`ensureActivityGroup()` (`static/ui.js`) creates the live activity group
with `tool-call-group-collapsed` whenever it's missing, and
`finalizeThinkingCard()` force-adds that class on every tool boundary.
The user's manual expand state lives only on the DOM class list, so
every thinking → tool → thinking transition (which destroys and recreates
the group) wipes the expand and snaps the panel shut.

Adds a per-turn singleton `_liveActivityUserExpanded` that tracks the
user's last explicit toggle (set by an inline `_onLiveActivityToggle()`
hook in the summary button's onclick). `ensureActivityGroup()` consults
it when re-creating the live group; `finalizeThinkingCard()` skips the
force-collapse when the user has explicitly expanded.
`clearLiveToolCards()` resets the tracker between turns so the next turn
starts at the default collapsed state.

## Tests

- `tests/test_issue1298_cancel_and_activity.py` — 9 tests:
  - 4 server-side: cancel synthesises user message, doesn't double-append,
    preserves attachments, no-ops when there's no pending
  - 5 client-side: source-level guards on `_liveActivityUserExpanded`,
    `ensureActivityGroup`, `finalizeThinkingCard`, the inline onclick,
    and `clearLiveToolCards`

All adjacent suites still pass (test_cancel_interrupt,
test_session_sidecar_repair, test_streaming_race_fix, test_regressions).

Closes #1298

Co-authored-by: YanTianlong-01 <[email protected]>
Copy link
Copy Markdown
Owner

@nesquena nesquena left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review — end-to-end ✅ (clean approve, branch needs rebase before merge)

Heads up — branch is stale

The PR branched off b24b033 (pre v0.50.245). Current master is at 52e1567, which includes the v0.50.245 batch (PR #1334). I attempted a rebase locally:

  • api/streaming.py and static/ui.js auto-merge cleanly (the PR's regions don't overlap with #1316/#1322/#1324/#1326's regions).
  • CHANGELOG.md has a textual conflict: the PR adds two entries under ## [Unreleased] while master now has ## [v0.50.245] — 2026-04-30 populated. Resolution is mechanical — keep both sections, with the PR's two entries under [Unreleased] and v0.50.245 unchanged below.

Tests pass identically against the un-rebased branch (3247 + 9 new = full suite count for this branch's lineage; rebased onto v0.50.245 the count would be 3266). Reviewer/release agent should rebase before merge.

What this ships

Closes #1298 — two distinct UI/data bugs reported by @YanTianlong-01 against v0.50.240:

  1. Issue 2 (data loss): Stop/Cancel during streaming wiped the user's typed message from the session JSON.
  2. Issue 1 (UX): The Activity panel auto-collapsed every time a new tool/thinking event arrived, even after the user had manually expanded it.

Files touched (commit d90a605):

  • api/streaming.py — +49 lines inside cancel_stream()'s post-cancel session lock (synthesizes a user turn from pending_user_message before the existing field-clear)
  • static/ui.js — +37 / -5 lines: _liveActivityUserExpanded tracker + helpers, restore-on-create, respect-on-finalize, reset-on-clear
  • tests/test_issue1298_cancel_and_activity.py — 326-line new file, 9 tests
  • CHANGELOG.md — Unreleased entries (will conflict on rebase, see above)

Issue verification

  • The reporter described both symptoms accurately. The triage comment on #1298 (left by nesquena-hermes) initially proposed Issue 2 as a client-side rollback bug in static/messages.js cancel handler. The PR diagnosis is stronger and correct: the data is actually lost from the session JSON server-side, not just from the in-memory client copy. I verified by re-reading cancel_stream() in current master — _cs.pending_user_message = None ran without first persisting the pending text to _cs.messages, then _cs.save() wrote the cleared state to disk. Page reload after cancel would see neither field set. The PR's server-side fix solves the root cause; a client-side fix would only mask the symptom on the same page load and lose the data on refresh.

End-to-end trace — Issue 2 (cancel data loss)

  1. Pending write at /api/chat/startapi/routes.py:3251-3258: under _get_session_agent_lock(session_id), sets s.pending_user_message = msg, s.pending_attachments = attachments, calls s.save(). Pending message is durably on disk.
  2. Streaming thread picks up the session, eventually reaches the success path under _agent_lock (api/streaming.py:1957:2152-2155):
    • Runs s.messages = _merge_display_messages_after_agent_result(...) (which incorporates the agent's result['messages'] — the agent appends user_msg to its internal list at the start of run_conversation, see run_agent.py:10286-10287).
    • Sets s.active_stream_id = None, s.pending_user_message = None, then s.save().
  3. Cancel race: cancel_stream() (api/streaming.py:2471+):
    • Pops STREAMS/CANCEL_FLAGS/AGENT_INSTANCES under STREAMS_LOCK, calls agent.interrupt().
    • Acquires _get_session_agent_lock(_cancel_session_id) — the same Lock the streaming thread uses.
    • PR's new block at api/streaming.py:2576-2627: while holding the lock, before clearing pending fields, looks at _cs.messages[-1]. If the trailing user message doesn't already match pending_user_message (using in substring match in either direction to tolerate the streaming thread's [Workspace: ...] prefix), synthesizes a user turn with timestamp + attachments and appends.
    • Clears pending fields, saves session, releases lock.

Race outcome by timing:

Cancel timing Pre-fix behaviour Post-fix behaviour
Beats streaming thread's merge (typical — agent thread blocked in API) s.messages empty + pending cleared → user message lost from disk Recovery synthesizes turn → user message persists
Streaming thread merged first Streaming thread already wrote user turn into s.messages → fine Cancel sees pending_user_message is None (streaming thread cleared it under same lock) → recovery short-circuits ✓
Streaming thread crashes/exception before merge _last_resort_sync_from_core (api/streaming.py:1313-1338) handles via core-transcript sync OR appends recovered turn + error marker Same as before — independent safety net still applies

The recovery is a defensive layer, primarily load-bearing in the typical case (cancel beats merge). Even when the streaming thread eventually runs its merge after cancel released the lock, the merge incorporates the user turn from the agent's result anyway, so the final state is consistent.

End-to-end trace — Issue 1 (activity panel auto-collapse)

Existing behaviour:

  • ensureActivityGroup() (static/ui.js:2538-2575) creates the group with tool-call-group-collapsed whenever it's missing, on appendThinking() calls.
  • finalizeThinkingCard() (static/ui.js:4177+) force-adds tool-call-group-collapsed on every tool boundary.
  • removeThinking() (static/ui.js:4236-4239) removes the empty live group from the DOM during thinking → tool transitions, so the next thinking event re-enters ensureActivityGroup and re-creates collapsed.

PR's fix:

  1. New module-scope tracker _liveActivityUserExpanded (undefined / true / false) at static/ui.js:2545.
  2. _onLiveActivityToggle(group) reads !group.classList.contains('tool-call-group-collapsed') after class toggle. Only fires for data-live-tool-call-group="1" (skips programmatic toggles on settled groups).
  3. _clearLiveActivityUserIntent() resets to undefined.
  4. ensureActivityGroup() re-create branch consults the tracker:
    if(live && _liveActivityUserExpanded === true) collapsed=false;
    else if(live && _liveActivityUserExpanded === false) collapsed=true;
    Strict === true / === false correctly distinguishes "user expanded", "user collapsed", and "user hasn't touched it" (undefined → fall through to default).
  5. Inline onclick template now invokes _onLiveActivityToggle(g) after the class-toggle line.
  6. finalizeThinkingCard() skips force-collapse when _liveActivityUserExpanded === true.
  7. clearLiveToolCards() resets the tracker between turns.

Cross-tool consistency

Issue 2 affects only WebUI session state — pending_user_message is a WebUI-only field that the CLI doesn't read or write. Verified: searching /tmp/hermes-agent-fresh/ for pending_user_message returns no hits. So this fix has no cross-tool surface.

Race / lock analysis

  • cancel_stream() already acquires _get_session_agent_lock(_cancel_session_id) at api/streaming.py:2580 before the existing field-clear. The PR's new recovery code lives inside that same lock acquisition, so it inherits the existing lock ordering (STREAMS_LOCK → release → session lock).
  • The streaming thread takes locks in the same order at api/streaming.py:1957. No deadlock risk.
  • Recovery and field-clear happen atomically from the streaming thread's perspective. Either the streaming thread's merge runs first (cancel sees pending=None, recovery short-circuits) or cancel runs first (recovery synthesizes, streaming thread's later merge overwrites with its own version that also has the user turn from the agent's result).
  • The recovery block is wrapped in its own try/except so non-list _cs.messages shapes (e.g. unit-test Mock sessions, malformed sidecars) can't propagate exceptions and skip the rest of the cleanup. Errors are debug-logged.

Activity-panel expand state-machine harness

         user clicks summary               _onLiveActivityToggle reads
         ─────────────────────────────►   classList → !contains('collapsed')
         classList.toggle('collapsed')    →  tracker=true (expanded) or false (collapsed)

         removeThinking() destroys group
         ─────────────────────────────►   tracker preserved (singleton, not on DOM)

         appendThinking() re-enters
         ensureActivityGroup()            →  consults tracker:
                                              true  → collapsed=false (restore)
                                              false → collapsed=true (respect collapse)
                                              undef → opts.collapsed default

         finalizeThinkingCard()           →  if tracker===true: SKIP force-collapse
                                              else: force-add 'collapsed'

         next turn: clearLiveToolCards()  →  _clearLiveActivityUserIntent() → undefined

Verified by grep that all five guards exist in static/ui.js. ✅

Edge-case matrix

# Scenario Expected Actual
1 Cancel before agent runs (pre-flight) Recovery synthesizes user turn ✅ test 1
2 Cancel after streaming thread merged No double-append ✅ test 2 + lock atomicity (streaming thread cleared pending under same lock)
3 Cancel with attachments Recovered turn preserves filenames ✅ test 3
4 Cancel with no pending (post-result) No phantom user turn ✅ test 4
5 Mock session in unit tests Cleanup completes, no cascade failure ✅ try/except wrap
6 User expands → tool boundary Stays expanded _liveActivityUserExpanded === true guard
7 User collapses → next event Stays collapsed ✅ tracker === false branch
8 Default first-turn state Starts collapsed ✅ tracker undefined → opts.collapsed
9 Turn finalize → next turn Resets to default ✅ clearLiveToolCards calls _clearLiveActivityUserIntent
10 Cross-tool (CLI reads same session) No change ✅ pending_user_message is WebUI-only

Other audit — things that are correct

  • No XSS surface — recovered user turn synthesizes a dict with the verbatim pending_user_message (the exact text the user typed), no template rendering. Inline onclick template change just adds if(typeof _onLiveActivityToggle==='function')_onLiveActivityToggle(g); — no user input flows through.
  • No new endpoints, file paths, or auth gates touched.
  • No eval/exec/Function() in the diff.
  • No secrets/tokens in the diff.
  • Timestamp on recovered turn uses int(time.time()), consistent with the rest of the codebase's user-message timestamps.
  • Attachments preserved correctly — copies pending_attachments only when it's a list/tuple, defending against malformed sidecars.

Tests

  • tests/test_issue1298_cancel_and_activity.py9/9 pass
  • 4 server-side tests exercise real cancel_stream() + Session save/load round-trips with isolated SESSION_DIR — not just static analysis. Verified the data loss at the disk level.
  • 5 client-side tests are source-level guards on static/ui.js (regex/grep checks). Not behavioural — but the changed surface is small enough that the static guards catch the relevant invariants (tracker exists, === true guard in finalize, inline onclick references the helper, clear function called from clearLiveToolCards).
  • Full suite (this branch's lineage): 3247 passed, 54 skipped, 3 xpassed, 0 failed in 19.53s
    • This branch has 9 new tests but lacks the +19 tests from PR #1334 (already merged into master). Rebased onto current master, expect ~3266 passing.

Minor observations (non-blocking)

  • The branch needs a rebase before merge — see top of review. CHANGELOG.md is the only conflict; code regions auto-merge. Rebase belongs to the release agent or maintainer; I haven't pushed.
  • Recovery is overwritten in the typical case — when cancel beats merge, recovery runs and saves; when streaming thread later runs its merge, it overwrites s.messages with its own list (which also contains the user turn via the agent's result['messages']). So the recovery is mostly load-bearing for the narrow case where the streaming thread can't merge (exception path, agent thread death, server restart). The typical-cancel case ends up with a single user turn either way. No bug — just a clarification of where the safety net actually fires.
  • Multimodal cancel race (very narrow): if a user attaches images and cancel races AFTER the streaming thread's merge ran (rare — streaming thread holds _agent_lock during merge, cancel waits), the merged user turn has list-shape content (multimodal parts). The recovery's content-match check isinstance(_last_content, str) would skip the equality test and return _already_persisted = False. However, by then the streaming thread has cleared pending_user_message under the same lock, so _pending_user is None short-circuits the entire recovery before the content check runs. Verified safe — flagging in case future refactors decouple "merge" from "clear pending" within the lock.
  • Triage comment proposed a different fix shape for Issue 2 (client-side static/messages.js cancel handler). The PR's server-side fix is the better choice — the data was actually missing from disk, not just from S.messages. The triage comment's second proposed shape (server-side _apply_core_sync_or_error_marker extension) overlaps with the PR's recovery; both end up persisting the recovered turn, just at different points in the cancel flow. The PR's choice (inline in cancel_stream) is more localized.
  • 5 client-side tests are static-only, but the changed JS surface is small enough that the regex guards lock the relevant invariants. A behavioral JSDOM harness could be added in a follow-up but isn't load-bearing here.

Recommendation

Approved. End-to-end trace verified the bug claim (cancel race → pending cleared without persistence → user message lost from session JSON), fix is sound under the existing per-session lock, no double-append risk, no cross-tool surface (pending_user_message is WebUI-only). Activity panel state-tracker correctly handles all toggle/destroy/recreate transitions with proper inter-turn reset. Branch is stale and needs a rebase before merge — code regions auto-merge cleanly, only CHANGELOG.md has a trivial textual conflict.

Parked at approval — ready for the release agent's rebase + merge/tag pipeline.

nesquena-hermes added a commit that referenced this pull request Apr 30, 2026
…us review)

Pre-release Opus review on v0.50.246 caught a SHOULD-FIX in PR #1338's
cancel_stream synthesis: the symmetric substring guard
(_pending_user in _last_content OR _last_content in _pending_user) was too
loose. Common confirmation replies ("ok", "yes", "go") in the prior turn
would match longer follow-up prompts ("ok please continue"), the synthesis
would be skipped, and the user's typed text would be lost — exactly the
data-loss bug #1298 was supposed to fix.

The fix: gate the substring check on a timestamp comparison. Only treat
the latest user turn as 'already merged by the streaming thread' if its
timestamp is at or after pending_started_at. Earlier turns whose content
happens to be a substring of the pending must not short-circuit synthesis.

Also drops the symmetric (_last_content in _pending_user) branch — that
direction was the false-positive vector. Keeps the equality and prefix
match (workspace-prefix tolerance from the streaming thread).

Adds tests/test_issue1298_cancel_and_activity.py::
test_cancel_synthesizes_when_prior_turn_content_is_substring_of_pending —
regression for the exact 'ok' → 'ok please continue' scenario.
@nesquena-hermes
Copy link
Copy Markdown
Collaborator Author

Shipped in v0.50.246 via batch release PR #1343 (merge commit dec4b486).

Live now at https://github.com/nesquena/hermes-webui/releases/tag/v0.50.246.

Released alongside 4 other contributor fixes — see the v0.50.246 entry in CHANGELOG.md.

pull Bot pushed a commit to JamesWilliam1977/hermes-webui that referenced this pull request Apr 30, 2026
…-panel expand state (nesquena#1298)

From PR nesquena#1338. Already independently APPROVED by nesquena before being absorbed into v0.50.246.

CHANGELOG entries from this PR were dropped during squash (the v0.50.245 section is already
shipped); they will be re-added under [v0.50.246] in the release commit.

Co-authored-by: nesquena-hermes <[email protected]>
pull Bot pushed a commit to JamesWilliam1977/hermes-webui that referenced this pull request Apr 30, 2026
Combines:
- 4 contributor PRs (nesquena#1335 user fenced code, nesquena#1337 mermaid+cache-bust,
  nesquena#1339 fallback_providers list, nesquena#1341 context_length persistence)
- Self-built nesquena#1338 (cancel data-loss + activity panel) — already
  independently APPROVED by nesquena before absorption
- CONTRIBUTORS.md and markdown refresh from nesquena#1340

See CHANGELOG.md for the full list with author credit.
GeoffBao pushed a commit to GeoffBao/hermes-webui that referenced this pull request May 1, 2026
…-panel expand state (nesquena#1298)

From PR nesquena#1338. Already independently APPROVED by nesquena before being absorbed into v0.50.246.

CHANGELOG entries from this PR were dropped during squash (the v0.50.245 section is already
shipped); they will be re-added under [v0.50.246] in the release commit.

Co-authored-by: nesquena-hermes <[email protected]>
GeoffBao pushed a commit to GeoffBao/hermes-webui that referenced this pull request May 1, 2026
Combines:
- 4 contributor PRs (nesquena#1335 user fenced code, nesquena#1337 mermaid+cache-bust,
  nesquena#1339 fallback_providers list, nesquena#1341 context_length persistence)
- Self-built nesquena#1338 (cancel data-loss + activity panel) — already
  independently APPROVED by nesquena before absorption
- CONTRIBUTORS.md and markdown refresh from nesquena#1340

See CHANGELOG.md for the full list with author credit.
GeoffBao pushed a commit to GeoffBao/hermes-webui that referenced this pull request May 1, 2026
…us review)

Pre-release Opus review on v0.50.246 caught a SHOULD-FIX in PR nesquena#1338's
cancel_stream synthesis: the symmetric substring guard
(_pending_user in _last_content OR _last_content in _pending_user) was too
loose. Common confirmation replies ("ok", "yes", "go") in the prior turn
would match longer follow-up prompts ("ok please continue"), the synthesis
would be skipped, and the user's typed text would be lost — exactly the
data-loss bug nesquena#1298 was supposed to fix.

The fix: gate the substring check on a timestamp comparison. Only treat
the latest user turn as 'already merged by the streaming thread' if its
timestamp is at or after pending_started_at. Earlier turns whose content
happens to be a substring of the pending must not short-circuit synthesis.

Also drops the symmetric (_last_content in _pending_user) branch — that
direction was the false-positive vector. Keeps the equality and prefix
match (workspace-prefix tolerance from the streaming thread).

Adds tests/test_issue1298_cancel_and_activity.py::
test_cancel_synthesizes_when_prior_turn_content_is_substring_of_pending —
regression for the exact 'ok' → 'ok please continue' scenario.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: Activity panel collapses during streaming and latest user message disappears after Stop

2 participants