Skip to content

fix: persist conversation state on interruption to prevent context loss#7785

Closed
rabi wants to merge 1 commit intoblock:mainfrom
rabi:interruption
Closed

fix: persist conversation state on interruption to prevent context loss#7785
rabi wants to merge 1 commit intoblock:mainfrom
rabi:interruption

Conversation

@rabi
Copy link
Copy Markdown
Contributor

@rabi rabi commented Mar 10, 2026

Summary

On interrupt , messages from the current agent turn are only in-memory and never persisted. The next agent.reply() loads from the session manager, losing all interrupted-turn context.

Call replace_conversation() on interruption in both CLI and server to sync the in-memory conversation to the session manager.

Type of Change

  • Feature
  • Bug fix
  • Refactor / Code quality
  • Performance improvement
  • Documentation
  • Tests
  • Security fix
  • Build / Release
  • Other (specify below)

Testing

Tested locally

On interrupt , messages from the current agent turn are only in-memory
and never persisted. The next agent.reply() loads from the session manager,
losing all interrupted-turn context.

Call replace_conversation() on interruption in both CLI and server to
sync the in-memory conversation to the session manager.

Change-Id: I4ccbb744642e2dd042a03ac7fe6a32e9cca45967
Signed-off-by: rabi <[email protected]>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3d3526a2b3

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

tracing::info!("Agent task cancelled");
if let Err(e) = state
.session_manager()
.replace_conversation(&session_id, &all_messages)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Persist interruption state from session storage, not stream cache

This replace_conversation writes back all_messages, but that cache can be stale relative to the DB because the agent mutates existing message metadata in place via SessionManager::update_message_metadata (see crates/goose/src/agents/agent.rs around the tool-pair summarization path) without emitting AgentEvent::HistoryReplaced; this handler only updates all_messages on Message/HistoryReplaced. If a cancellation/disconnect happens after those metadata-only updates, this call can overwrite the persisted conversation with older metadata and effectively undo invisibility flags for prior tool messages.

Useful? React with 👍 / 👎.

.agent
.config
.session_manager
.replace_conversation(&self.session_id, &self.messages)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid clobbering metadata when persisting interrupted CLI turns

Persisting self.messages here has the same stale-snapshot problem as the server path: local conversation state is driven by streamed events and does not observe metadata-only DB updates (e.g., agent-side update_message_metadata during tool-pair summarization in crates/goose/src/agents/agent.rs). On an interrupt, this can replace the stored conversation with an older metadata view and re-expose messages that were marked invisible.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. we probably don't want make it complicated and the scenario of metadata being overwritten as an edge case is better than missing the full context for the interrupted turn.

@DOsinga DOsinga self-assigned this Mar 11, 2026
@DOsinga
Copy link
Copy Markdown
Collaborator

DOsinga commented Mar 11, 2026

this feels dangerous to me. the reason we do it the way that we do now is that we never want to be in an inconsistent state with the conversation that we have in the db.

if we wanted to go this way, we should just add the messages more often to the session manager rather than waiting for the loop to finish. we used to just add the messages as they came in which would corrupt the conversation. replacing the conversation in general doesn't seem like a great idea for what codex says and other reasons.

thoughts?

Copy link
Copy Markdown
Collaborator

@codefromthecrypt codefromthecrypt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opened #7827 which properly introduces the gap and hopefully captures @DOsinga and copilot concerns with this PR. I've mentioned some of the problems in "alternatives considered"

This PR isn't valid for this sort of problem, and here are the largest issues:

  • Wrong tool: replace_conversation DELETEs all messages then re-inserts
  • Overfires: triggers on any error or disconnect, not just cancel
  • Wrong layer: patches CLI/server instead of agent.rs batching (so need another fix for ACP, etc)
  • No tests: no way to verify this worked

@DOsinga DOsinga mentioned this pull request Mar 12, 2026
@rabi
Copy link
Copy Markdown
Contributor Author

rabi commented Mar 13, 2026

we should just add the messages more often to the session manager rather than waiting for the loop to finish

@DOsinga, I did not go the path of adding every message to the db (was also suggested by my agent like may be for others), as anyway for retries we've to reset and replace_conversation. Also, if cancellation hits between persisting a tool request and its response, the DB has an orphaned request that fix_conversation silently strips on the next turn.

I think the metadata overwrite concern (agent_invisible flags from tool-pair summarization) requires a narrow timing window and only results in the model seeing slightly more context — no data loss.

As far as not adding any tests to the PR, I did not see any easy way of doing it.

I see that you've started #7850 to address this issue. So I'll drop this and move the discussion there.

@rabi rabi closed this Mar 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants