Skip to content

fix: resolve tokio::sync::Mutex deadlock in recipe retry path#7832

Merged
DOsinga merged 3 commits intoblock:mainfrom
wilfriedroset:deadlock
Mar 13, 2026
Merged

fix: resolve tokio::sync::Mutex deadlock in recipe retry path#7832
DOsinga merged 3 commits intoblock:mainfrom
wilfriedroset:deadlock

Conversation

@wilfriedroset
Copy link
Copy Markdown
Contributor

Summary

In Rust 2021, the scrutinee of an if let expression creates a temporary whose scope extends to the entire if statement, including all else branches. When self.final_output_tool.lock().await was used as an if let scrutinee, the MutexGuard remained alive in the else branches. The retry path in the final else branch called handle_retry_logic -> reset_status_for_retry, which attempted to acquire the same lock, causing an indefinite deadlock since tokio::sync::Mutex is not reentrant.

This manifested as goose run --recipe hanging ~80-90% of the time when the model responded with text-only output (no tool calls) and retry logic was triggered.

Changes:

  • agent.rs (line ~1595): Extract mutex inspection into an explicit block with a FinalOutputAction enum, ensuring the MutexGuard is dropped before any branch that may re-acquire the lock. Replace fragile unwrap() with nested pattern matching.

  • agent.rs (line ~1137): Apply the same explicit-scope idiom to the similar if let ... lock().await pattern at the top of the main loop. Replace is_some() + clone().unwrap() with and_then() + if let Some(ref output).

  • retry.rs (line ~107): Replace if let ... lock().await.as_mut() with an explicit let mut guard = ... binding, making the drop point visible and preventing future regressions if an else branch is added.

Validated with 10/10 consecutive successful runs (9-25s each) vs ~2/10 before the fix. The bug should be provider-independent but was reproduced with models that frequently respond with text-only output instead of tool calls. I've tested it extensively on OVHcloud and manually with gpt-oss-120b, gpt-oss-20b, Qwen3-32B and Meta-Llama-3_3-70B-Instruct.

Type of Change

  • Feature
  • Bug fix
  • Refactor / Code quality
  • Performance improvement
  • Documentation
  • Tests
  • Security fix
  • Build / Release
  • Other (specify below)

AI Assistance

  • This PR was created or reviewed with AI assistance

Testing

goose run --recipe sql-migration-review.yaml --params "diff_content=./samples/my-sample.txt"

The recipe look like this:

version: "1.0.0"
title: "SQL Migration PR Review"
description: >-
  Semantic design review of SQL migration files in a PR.
  Complements the sql-migrate-checks suite (SM0001-SM0019) by focusing on
  schema design quality, migration safety, data integrity, performance,
  and backward compatibility.
parameters:
  - key: diff_content
    input_type: file
    requirement: required
    description: "Path to the file containing the SQL diff to review (dialect, file list, and unified diff)"
instructions: |
  You are a senior database reliability engineer reviewing SQL migration files
  in a pull request. Your job is to produce a concise, actionable design review.

  IMPORTANT CONSTRAINTS:
  - Do NOT check SM0001-SM0019 rules (filename ordering, DEFAULT NULL, IF EXISTS,
    TIMETZ, OWNER TO, ENGINE=, meta.json schema, etc.). Those are already enforced
    by the sql-migrate-checks CI suite.
  - Do NOT report syntax or formatting issues.
  - Focus exclusively on semantic design review: things scripts cannot catch.
  - Your report MUST be shorter than the combined diff you reviewed.

  Follow these steps exactly:

  STEP 1 - Parse the diff:
  The diff content is provided below between the markers. It contains a header
  with the database dialect and the list of changed SQL files, followed by the
  full unified diff of all SQL migration changes in this PR.

  <diff>
  {{ diff_content | indent(4) }}
  </diff>

  If the diff section is empty or contains no meaningful SQL changes, write
  AI-REVIEW.md with "No SQL migration files found in this PR." and stop.

  STEP 2 - Review each migration against these focus areas:

  A. Schema design quality
     - Are data types appropriate? (e.g., TEXT with CHECK vs VARCHAR, TIMESTAMPTZ vs
       TIMESTAMP, NUMERIC for money, BIGINT for IDs)
     - Are naming conventions consistent? (snake_case, meaningful names)
     - Is normalization appropriate?

  B. Migration safety
     - Will ALTER TABLE acquire an ACCESS EXCLUSIVE lock on a potentially large table?
     - Should indexes be created CONCURRENTLY?
     - Are there operations that could cause downtime on large tables?
     - For PostgreSQL: adding a column with a volatile DEFAULT rewrites the table
       in versions before PG 11.

  C. Data integrity
     - Does the Down migration risk data loss? (DROP COLUMN, DROP TABLE without
       consideration for data backup)
     - Are there unsafe type casts or column type changes?
     - Are NOT NULL constraints added without DEFAULT on existing populated tables?

  D. Performance implications
     - Are there missing indexes for likely query patterns?
     - Are there columns that would benefit from partial or expression indexes?
     - For large tables: should partitioning be considered?
     - Are composite indexes ordered correctly for query selectivity?

  E. Backward compatibility
     - Will this migration break existing consumers? (column renames, type changes,
       dropped columns)
     - Would an add-then-migrate-then-drop pattern be safer?

  Consult your database-design skill references for dialect-specific best practices
  (postgresql-database-design.md, mysql-database-design.md, query patterns).

  STEP 3 - Classify each finding:
  - BLOCKER: Design flaws that risk data loss, extended downtime, or corruption.
    These must be addressed before merge.
  - QUESTION: Ambiguous patterns where human clarification is needed from the
    development or DBRE team.
  - SUGGESTION: Optional improvements for better design or performance. Do not
    block merge.

  STEP 4 - Write AI-REVIEW.md using EXACTLY this markdown structure.
  Copy the formatting precisely. Do NOT change heading levels or turn bold
  text into headings.

  Line 1 MUST be: # 🔍 SQL Migration Review
  (single # for the title, NOT ##)

  Then a blank line, then these two metadata lines using bold text (NOT headings):
  **📂 Files reviewed**: `file1.sql`, `file2.sql`
  **🗄️ Dialect**: PostgreSQL

  Then a blank line, then: ## 📋 Executive Summary
  Then a blank line, then 2-3 sentences.

  Then append ONLY sections that have findings. Each section uses ## heading,
  a blockquote description, and bulleted findings with severity emoji prefixes:

  ## 🚨 Blockers
  (blank line)
  > Items that **must** be addressed before merge.
  (blank line)
  - 🔴 **file.sql** (line N): description

  ## ❓ Open Questions
  (blank line)
  > Items requiring human clarification from the development or DBRE team.
  (blank line)
  - 🟡 **file.sql** (line N): question

  ## 💡 Suggestions
  (blank line)
  > Optional improvements. Do not block merge.
  (blank line)
  - 🟢 **file.sql** (line N): suggestion

  RULES FOR THE REPORT:
  - Title uses single # and finding sections use ##. Never use ### or deeper.
  - **📂 Files reviewed** and **🗄️ Dialect** are bold text lines, NOT headings.
  - Filenames in the "Files reviewed" line MUST be wrapped in backticks.
  - Executive Summary is ALWAYS present.
  - Blockers, Open Questions, Suggestions appear ONLY if they have items.
  - EVERY bullet point MUST use the correct colored circle emoji prefix:
    🔴 for Blockers, 🟡 for Open Questions, 🟢 for Suggestions.
  - EVERY bullet point in Blockers, Open Questions, and Suggestions MUST start
    with the emoji then **filename** (line N): — never omit the file and line reference.
  - Do NOT include any SM error codes.
  - Do NOT repeat what the diff already says; add insight the developer does not have.
  - Keep findings concise: one sentence for the issue, one for the recommendation.
  - The total report MUST be shorter than the combined diff.

  Write the AI-REVIEW.md file and confirm it was written successfully.
prompt: |
  Review the SQL migration changes provided in your instructions.
  Follow every step in your instructions. Write the AI-REVIEW.md report.
settings:
  max_turns: 5
retry:
  max_retries: 1
  checks:
    - type: shell
      command: "test -s AI-REVIEW.md"

Related Issues

Relates to N/A
Discussion: N/A

wilfriedroset and others added 2 commits March 12, 2026 11:47
In Rust 2021, the scrutinee of an `if let` expression creates a
temporary whose scope extends to the entire `if` statement, including
all `else` branches. When `self.final_output_tool.lock().await` was
used as an `if let` scrutinee, the MutexGuard remained alive in the
`else` branches. The retry path in the final `else` branch called
`handle_retry_logic` -> `reset_status_for_retry`, which attempted to
acquire the same lock, causing an indefinite deadlock since
tokio::sync::Mutex is not reentrant.

This manifested as `goose run --recipe` hanging ~80-90% of the time
when the model responded with text-only output (no tool calls) and
retry logic was triggered.

Changes:

- agent.rs (line ~1595): Extract mutex inspection into an explicit
  block with a `FinalOutputAction` enum, ensuring the MutexGuard is
  dropped before any branch that may re-acquire the lock. Replace
  fragile `unwrap()` with nested pattern matching.

- agent.rs (line ~1137): Apply the same explicit-scope idiom to the
  similar `if let ... lock().await` pattern at the top of the main
  loop. Replace `is_some()` + `clone().unwrap()` with `and_then()` +
  `if let Some(ref output)`.

- retry.rs (line ~107): Replace `if let ... lock().await.as_mut()`
  with an explicit `let mut guard = ...` binding, making the drop
  point visible and preventing future regressions if an `else` branch
  is added.

Validated with 10/10 consecutive successful runs (9-25s each) vs
~2/10 before the fix. The bug is provider-independent but was
reproduced with models that frequently respond with text-only output
instead of tool calls.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Signed-off-by: Wilfried Roset <[email protected]>
Copy link
Copy Markdown
Collaborator

@DOsinga DOsinga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed a simplification to your branch, have a look if you like it. if you do or no reply, I'll go ahead and merge this soon. good fix!

@wilfriedroset
Copy link
Copy Markdown
Contributor Author

wilfriedroset commented Mar 12, 2026

lgtm. thank you.

@DOsinga DOsinga mentioned this pull request Mar 13, 2026
Resolve merge conflict in crates/goose/src/agents/agent.rs:
- Keep the deadlock-safe pattern (extract lock state upfront, drop guard
  before branching into handle_retry_logic)
- Incorporate main's addition of session_manager.replace_conversation()
  and AgentEvent::HistoryReplaced in the retry success path
@DOsinga DOsinga added this pull request to the merge queue Mar 13, 2026
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to no response for status checks Mar 13, 2026
@DOsinga DOsinga added this pull request to the merge queue Mar 13, 2026
Merged via the queue into block:main with commit 196ee3c Mar 13, 2026
21 checks passed
michaelneale added a commit that referenced this pull request Mar 15, 2026
* origin/main:
  fix: tool confirmation handling for multiple requests (#7856)
  Remove dead OllamaSetup onboarding flow (#7861)
  fix: resolve tokio::sync::Mutex deadlock in recipe retry path (#7832)
  Upgrade Electron 40.6.0 → 41.0.0 (#7851)
  Only show up to 50 lines of source code (#7578)
  fix: stop writing without error when hitting broken pipe for goose session list (#7858)
  feat(acp): add session/set_mode handler (#7801)
  Keep messages in sync (#7850)
  More acp tools (#7843)
lifeizhou-ap added a commit that referenced this pull request Mar 16, 2026
* main: (191 commits)
  fix: add tool_choice and parallel_tool_calls to chatgpt_codex provider (#7867)
  fix: tool confirmation handling for multiple requests (#7856)
  Remove dead OllamaSetup onboarding flow (#7861)
  fix: resolve tokio::sync::Mutex deadlock in recipe retry path (#7832)
  Upgrade Electron 40.6.0 → 41.0.0 (#7851)
  Only show up to 50 lines of source code (#7578)
  fix: stop writing without error when hitting broken pipe for goose session list (#7858)
  feat(acp): add session/set_mode handler (#7801)
  Keep messages in sync (#7850)
  More acp tools (#7843)
  fix: skip upgrade-insecure-requests CSP for external HTTP backends (#7714)
  fix(shell): prevent hang when command backgrounds a child process (#7689)
  Remove include from Cargo.toml in goose-mcp (#7838)
  Exit agent loop when tool call JSON fails to parse (#7840)
  chore: remove redundant husky prepare script (#7829)
  Add github actions workflow for unused deps (#7681)
  fix: prevent SSE connection drops from silently truncating responses (#7831)
  doc: Added notes in contribution guide for pnpm (#7833)
  add prefer-offline to pnpm config to skip unnecessary registry lookups (#7828)
  fix: remove dead read handler from DeveloperClient (#7821)
  ...
jh-block added a commit that referenced this pull request Mar 16, 2026
* main: (65 commits)
  feat(otel): propagate session.id to spans and log records (#7490)
  fix(test): add env_lock to is_openai_reasoning_model tests (#7917)
  fix(acp): pass session_id when loading extensions so skills are discovered (#7868)
  updated canonical models (#7920)
  feat(autovisualiser): Migrate the autovisualiser extension to MCP Apps  (#7852)
  fix: add tool_choice and parallel_tool_calls to chatgpt_codex provider (#7867)
  fix: tool confirmation handling for multiple requests (#7856)
  Remove dead OllamaSetup onboarding flow (#7861)
  fix: resolve tokio::sync::Mutex deadlock in recipe retry path (#7832)
  Upgrade Electron 40.6.0 → 41.0.0 (#7851)
  Only show up to 50 lines of source code (#7578)
  fix: stop writing without error when hitting broken pipe for goose session list (#7858)
  feat(acp): add session/set_mode handler (#7801)
  Keep messages in sync (#7850)
  More acp tools (#7843)
  fix: skip upgrade-insecure-requests CSP for external HTTP backends (#7714)
  fix(shell): prevent hang when command backgrounds a child process (#7689)
  Remove include from Cargo.toml in goose-mcp (#7838)
  Exit agent loop when tool call JSON fails to parse (#7840)
  chore: remove redundant husky prepare script (#7829)
  ...
wpfleger96 added a commit that referenced this pull request Mar 16, 2026
…oken-retry

* origin/main: (21 commits)
  Remove java/.ai-usage-marker directory (#7925)
  test(acp): add terminal delegation fixtures and fix shell singleton (#7923)
  fix: bump pctx_code_mode to 0.3.0 for iterator type checking fix (#7892)
  feat: persist GooseMode per-session via session DB (#7854)
  feat(otel): propagate session.id to spans and log records (#7490)
  fix(test): add env_lock to is_openai_reasoning_model tests (#7917)
  fix(acp): pass session_id when loading extensions so skills are discovered (#7868)
  updated canonical models (#7920)
  feat(autovisualiser): Migrate the autovisualiser extension to MCP Apps  (#7852)
  fix: add tool_choice and parallel_tool_calls to chatgpt_codex provider (#7867)
  fix: tool confirmation handling for multiple requests (#7856)
  Remove dead OllamaSetup onboarding flow (#7861)
  fix: resolve tokio::sync::Mutex deadlock in recipe retry path (#7832)
  Upgrade Electron 40.6.0 → 41.0.0 (#7851)
  Only show up to 50 lines of source code (#7578)
  fix: stop writing without error when hitting broken pipe for goose session list (#7858)
  feat(acp): add session/set_mode handler (#7801)
  Keep messages in sync (#7850)
  More acp tools (#7843)
  fix: skip upgrade-insecure-requests CSP for external HTTP backends (#7714)
  ...
wpfleger96 added a commit that referenced this pull request Mar 16, 2026
* origin/main: (72 commits)
  No Check do Check (#7942)
  Log 500 errors and also show error for direct download (#7936)
  fix: retry on authentication failure with credential refresh (#7812)
  Remove java/.ai-usage-marker directory (#7925)
  test(acp): add terminal delegation fixtures and fix shell singleton (#7923)
  fix: bump pctx_code_mode to 0.3.0 for iterator type checking fix (#7892)
  feat: persist GooseMode per-session via session DB (#7854)
  feat(otel): propagate session.id to spans and log records (#7490)
  fix(test): add env_lock to is_openai_reasoning_model tests (#7917)
  fix(acp): pass session_id when loading extensions so skills are discovered (#7868)
  updated canonical models (#7920)
  feat(autovisualiser): Migrate the autovisualiser extension to MCP Apps  (#7852)
  fix: add tool_choice and parallel_tool_calls to chatgpt_codex provider (#7867)
  fix: tool confirmation handling for multiple requests (#7856)
  Remove dead OllamaSetup onboarding flow (#7861)
  fix: resolve tokio::sync::Mutex deadlock in recipe retry path (#7832)
  Upgrade Electron 40.6.0 → 41.0.0 (#7851)
  Only show up to 50 lines of source code (#7578)
  fix: stop writing without error when hitting broken pipe for goose session list (#7858)
  feat(acp): add session/set_mode handler (#7801)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants