fix: persist conversation state on interruption to prevent context loss#7785
fix: persist conversation state on interruption to prevent context loss#7785rabi wants to merge 1 commit intoblock:mainfrom
Conversation
On interrupt , messages from the current agent turn are only in-memory and never persisted. The next agent.reply() loads from the session manager, losing all interrupted-turn context. Call replace_conversation() on interruption in both CLI and server to sync the in-memory conversation to the session manager. Change-Id: I4ccbb744642e2dd042a03ac7fe6a32e9cca45967 Signed-off-by: rabi <[email protected]>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3d3526a2b3
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| tracing::info!("Agent task cancelled"); | ||
| if let Err(e) = state | ||
| .session_manager() | ||
| .replace_conversation(&session_id, &all_messages) |
There was a problem hiding this comment.
Persist interruption state from session storage, not stream cache
This replace_conversation writes back all_messages, but that cache can be stale relative to the DB because the agent mutates existing message metadata in place via SessionManager::update_message_metadata (see crates/goose/src/agents/agent.rs around the tool-pair summarization path) without emitting AgentEvent::HistoryReplaced; this handler only updates all_messages on Message/HistoryReplaced. If a cancellation/disconnect happens after those metadata-only updates, this call can overwrite the persisted conversation with older metadata and effectively undo invisibility flags for prior tool messages.
Useful? React with 👍 / 👎.
| .agent | ||
| .config | ||
| .session_manager | ||
| .replace_conversation(&self.session_id, &self.messages) |
There was a problem hiding this comment.
Avoid clobbering metadata when persisting interrupted CLI turns
Persisting self.messages here has the same stale-snapshot problem as the server path: local conversation state is driven by streamed events and does not observe metadata-only DB updates (e.g., agent-side update_message_metadata during tool-pair summarization in crates/goose/src/agents/agent.rs). On an interrupt, this can replace the stored conversation with an older metadata view and re-expose messages that were marked invisible.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Hmm. we probably don't want make it complicated and the scenario of metadata being overwritten as an edge case is better than missing the full context for the interrupted turn.
|
this feels dangerous to me. the reason we do it the way that we do now is that we never want to be in an inconsistent state with the conversation that we have in the db. if we wanted to go this way, we should just add the messages more often to the session manager rather than waiting for the loop to finish. we used to just add the messages as they came in which would corrupt the conversation. replacing the conversation in general doesn't seem like a great idea for what codex says and other reasons. thoughts? |
There was a problem hiding this comment.
I opened #7827 which properly introduces the gap and hopefully captures @DOsinga and copilot concerns with this PR. I've mentioned some of the problems in "alternatives considered"
This PR isn't valid for this sort of problem, and here are the largest issues:
- Wrong tool:
replace_conversationDELETEs all messages then re-inserts - Overfires: triggers on any error or disconnect, not just cancel
- Wrong layer: patches CLI/server instead of agent.rs batching (so need another fix for ACP, etc)
- No tests: no way to verify this worked
@DOsinga, I did not go the path of adding every message to the db (was also suggested by my agent like may be for others), as anyway for retries we've to reset and replace_conversation. Also, if cancellation hits between persisting a tool request and its response, the DB has an orphaned request that fix_conversation silently strips on the next turn. I think the metadata overwrite concern (agent_invisible flags from tool-pair summarization) requires a narrow timing window and only results in the model seeing slightly more context — no data loss. As far as not adding any tests to the PR, I did not see any easy way of doing it. I see that you've started #7850 to address this issue. So I'll drop this and move the discussion there. |
Summary
On interrupt , messages from the current agent turn are only in-memory and never persisted. The next agent.reply() loads from the session manager, losing all interrupted-turn context.
Call replace_conversation() on interruption in both CLI and server to sync the in-memory conversation to the session manager.
Type of Change
Testing
Tested locally