fix: resolve tokio::sync::Mutex deadlock in recipe retry path#7832
Merged
DOsinga merged 3 commits intoblock:mainfrom Mar 13, 2026
Merged
fix: resolve tokio::sync::Mutex deadlock in recipe retry path#7832DOsinga merged 3 commits intoblock:mainfrom
DOsinga merged 3 commits intoblock:mainfrom
Conversation
In Rust 2021, the scrutinee of an `if let` expression creates a temporary whose scope extends to the entire `if` statement, including all `else` branches. When `self.final_output_tool.lock().await` was used as an `if let` scrutinee, the MutexGuard remained alive in the `else` branches. The retry path in the final `else` branch called `handle_retry_logic` -> `reset_status_for_retry`, which attempted to acquire the same lock, causing an indefinite deadlock since tokio::sync::Mutex is not reentrant. This manifested as `goose run --recipe` hanging ~80-90% of the time when the model responded with text-only output (no tool calls) and retry logic was triggered. Changes: - agent.rs (line ~1595): Extract mutex inspection into an explicit block with a `FinalOutputAction` enum, ensuring the MutexGuard is dropped before any branch that may re-acquire the lock. Replace fragile `unwrap()` with nested pattern matching. - agent.rs (line ~1137): Apply the same explicit-scope idiom to the similar `if let ... lock().await` pattern at the top of the main loop. Replace `is_some()` + `clone().unwrap()` with `and_then()` + `if let Some(ref output)`. - retry.rs (line ~107): Replace `if let ... lock().await.as_mut()` with an explicit `let mut guard = ...` binding, making the drop point visible and preventing future regressions if an `else` branch is added. Validated with 10/10 consecutive successful runs (9-25s each) vs ~2/10 before the fix. The bug is provider-independent but was reproduced with models that frequently respond with text-only output instead of tool calls. Co-Authored-By: Claude Opus 4.6 <[email protected]> Signed-off-by: Wilfried Roset <[email protected]>
DOsinga
approved these changes
Mar 12, 2026
Collaborator
DOsinga
left a comment
There was a problem hiding this comment.
I pushed a simplification to your branch, have a look if you like it. if you do or no reply, I'll go ahead and merge this soon. good fix!
Contributor
Author
|
lgtm. thank you. |
Closed
Resolve merge conflict in crates/goose/src/agents/agent.rs: - Keep the deadlock-safe pattern (extract lock state upfront, drop guard before branching into handle_retry_logic) - Incorporate main's addition of session_manager.replace_conversation() and AgentEvent::HistoryReplaced in the retry success path
michaelneale
added a commit
that referenced
this pull request
Mar 15, 2026
* origin/main: fix: tool confirmation handling for multiple requests (#7856) Remove dead OllamaSetup onboarding flow (#7861) fix: resolve tokio::sync::Mutex deadlock in recipe retry path (#7832) Upgrade Electron 40.6.0 → 41.0.0 (#7851) Only show up to 50 lines of source code (#7578) fix: stop writing without error when hitting broken pipe for goose session list (#7858) feat(acp): add session/set_mode handler (#7801) Keep messages in sync (#7850) More acp tools (#7843)
lifeizhou-ap
added a commit
that referenced
this pull request
Mar 16, 2026
* main: (191 commits) fix: add tool_choice and parallel_tool_calls to chatgpt_codex provider (#7867) fix: tool confirmation handling for multiple requests (#7856) Remove dead OllamaSetup onboarding flow (#7861) fix: resolve tokio::sync::Mutex deadlock in recipe retry path (#7832) Upgrade Electron 40.6.0 → 41.0.0 (#7851) Only show up to 50 lines of source code (#7578) fix: stop writing without error when hitting broken pipe for goose session list (#7858) feat(acp): add session/set_mode handler (#7801) Keep messages in sync (#7850) More acp tools (#7843) fix: skip upgrade-insecure-requests CSP for external HTTP backends (#7714) fix(shell): prevent hang when command backgrounds a child process (#7689) Remove include from Cargo.toml in goose-mcp (#7838) Exit agent loop when tool call JSON fails to parse (#7840) chore: remove redundant husky prepare script (#7829) Add github actions workflow for unused deps (#7681) fix: prevent SSE connection drops from silently truncating responses (#7831) doc: Added notes in contribution guide for pnpm (#7833) add prefer-offline to pnpm config to skip unnecessary registry lookups (#7828) fix: remove dead read handler from DeveloperClient (#7821) ...
jh-block
added a commit
that referenced
this pull request
Mar 16, 2026
* main: (65 commits) feat(otel): propagate session.id to spans and log records (#7490) fix(test): add env_lock to is_openai_reasoning_model tests (#7917) fix(acp): pass session_id when loading extensions so skills are discovered (#7868) updated canonical models (#7920) feat(autovisualiser): Migrate the autovisualiser extension to MCP Apps (#7852) fix: add tool_choice and parallel_tool_calls to chatgpt_codex provider (#7867) fix: tool confirmation handling for multiple requests (#7856) Remove dead OllamaSetup onboarding flow (#7861) fix: resolve tokio::sync::Mutex deadlock in recipe retry path (#7832) Upgrade Electron 40.6.0 → 41.0.0 (#7851) Only show up to 50 lines of source code (#7578) fix: stop writing without error when hitting broken pipe for goose session list (#7858) feat(acp): add session/set_mode handler (#7801) Keep messages in sync (#7850) More acp tools (#7843) fix: skip upgrade-insecure-requests CSP for external HTTP backends (#7714) fix(shell): prevent hang when command backgrounds a child process (#7689) Remove include from Cargo.toml in goose-mcp (#7838) Exit agent loop when tool call JSON fails to parse (#7840) chore: remove redundant husky prepare script (#7829) ...
wpfleger96
added a commit
that referenced
this pull request
Mar 16, 2026
…oken-retry * origin/main: (21 commits) Remove java/.ai-usage-marker directory (#7925) test(acp): add terminal delegation fixtures and fix shell singleton (#7923) fix: bump pctx_code_mode to 0.3.0 for iterator type checking fix (#7892) feat: persist GooseMode per-session via session DB (#7854) feat(otel): propagate session.id to spans and log records (#7490) fix(test): add env_lock to is_openai_reasoning_model tests (#7917) fix(acp): pass session_id when loading extensions so skills are discovered (#7868) updated canonical models (#7920) feat(autovisualiser): Migrate the autovisualiser extension to MCP Apps (#7852) fix: add tool_choice and parallel_tool_calls to chatgpt_codex provider (#7867) fix: tool confirmation handling for multiple requests (#7856) Remove dead OllamaSetup onboarding flow (#7861) fix: resolve tokio::sync::Mutex deadlock in recipe retry path (#7832) Upgrade Electron 40.6.0 → 41.0.0 (#7851) Only show up to 50 lines of source code (#7578) fix: stop writing without error when hitting broken pipe for goose session list (#7858) feat(acp): add session/set_mode handler (#7801) Keep messages in sync (#7850) More acp tools (#7843) fix: skip upgrade-insecure-requests CSP for external HTTP backends (#7714) ...
wpfleger96
added a commit
that referenced
this pull request
Mar 16, 2026
* origin/main: (72 commits) No Check do Check (#7942) Log 500 errors and also show error for direct download (#7936) fix: retry on authentication failure with credential refresh (#7812) Remove java/.ai-usage-marker directory (#7925) test(acp): add terminal delegation fixtures and fix shell singleton (#7923) fix: bump pctx_code_mode to 0.3.0 for iterator type checking fix (#7892) feat: persist GooseMode per-session via session DB (#7854) feat(otel): propagate session.id to spans and log records (#7490) fix(test): add env_lock to is_openai_reasoning_model tests (#7917) fix(acp): pass session_id when loading extensions so skills are discovered (#7868) updated canonical models (#7920) feat(autovisualiser): Migrate the autovisualiser extension to MCP Apps (#7852) fix: add tool_choice and parallel_tool_calls to chatgpt_codex provider (#7867) fix: tool confirmation handling for multiple requests (#7856) Remove dead OllamaSetup onboarding flow (#7861) fix: resolve tokio::sync::Mutex deadlock in recipe retry path (#7832) Upgrade Electron 40.6.0 → 41.0.0 (#7851) Only show up to 50 lines of source code (#7578) fix: stop writing without error when hitting broken pipe for goose session list (#7858) feat(acp): add session/set_mode handler (#7801) ...
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
In Rust 2021, the scrutinee of an
if letexpression creates a temporary whose scope extends to the entireifstatement, including allelsebranches. Whenself.final_output_tool.lock().awaitwas used as anif letscrutinee, the MutexGuard remained alive in theelsebranches. The retry path in the finalelsebranch calledhandle_retry_logic->reset_status_for_retry, which attempted to acquire the same lock, causing an indefinite deadlock since tokio::sync::Mutex is not reentrant.This manifested as
goose run --recipehanging ~80-90% of the time when the model responded with text-only output (no tool calls) and retry logic was triggered.Changes:
agent.rs (line ~1595): Extract mutex inspection into an explicit block with a
FinalOutputActionenum, ensuring the MutexGuard is dropped before any branch that may re-acquire the lock. Replace fragileunwrap()with nested pattern matching.agent.rs (line ~1137): Apply the same explicit-scope idiom to the similar
if let ... lock().awaitpattern at the top of the main loop. Replaceis_some()+clone().unwrap()withand_then()+if let Some(ref output).retry.rs (line ~107): Replace
if let ... lock().await.as_mut()with an explicitlet mut guard = ...binding, making the drop point visible and preventing future regressions if anelsebranch is added.Validated with 10/10 consecutive successful runs (9-25s each) vs ~2/10 before the fix. The bug should be provider-independent but was reproduced with models that frequently respond with text-only output instead of tool calls. I've tested it extensively on OVHcloud and manually with
gpt-oss-120b,gpt-oss-20b,Qwen3-32BandMeta-Llama-3_3-70B-Instruct.Type of Change
AI Assistance
Testing
The recipe look like this:
Related Issues
Relates to N/A
Discussion: N/A