fix(output): respect ToolOutput.max_retries parameter#4687
fix(output): respect ToolOutput.max_retries parameter#4687DouweM merged 9 commits intopydantic:mainfrom
ToolOutput.max_retries parameter#4687Conversation
|
Thanks for your interest in this issue! However, there are already open PRs addressing issue #4678: #4685. To avoid duplicate efforts, this PR has been closed. If you'd like to contribute, you can review the existing PRs or share your thoughts on issue #4678. If you believe the existing PRs are inactive, please comment on the issue and a maintainer can reassess. |
| assert max_retries_log == [target_retries] * (target_retries + 1) | ||
|
|
||
|
|
||
| def test_tool_output_max_retries_respected(): |
There was a problem hiding this comment.
This test should be removed. It imports from the private pydantic_ai._output module and directly inspects toolset.max_retries — both of which violate the testing guidelines around testing through public APIs and not accessing private internals.
The second test (test_tool_output_max_retries_overrides_agent_retries) already covers the same behavior through the public API by verifying that ctx.max_retries reflects the ToolOutput value at runtime, which is the meaningful assertion.
|
|
||
| def test_tool_output_max_retries_overrides_agent_retries(): | ||
| """ToolOutput.max_retries takes priority over Agent retries. Regression test for #4678.""" | ||
| from pydantic_ai.output import ToolOutput |
There was a problem hiding this comment.
ToolOutput is already imported at line 70 — remove this inline import. Imports should be at the top of the file.
|
|
||
| result = agent.run_sync('Hello') | ||
| assert result.output == 'Weather in Mexico City' | ||
| assert all(m == target_retries for m in max_retries_seen) |
There was a problem hiding this comment.
Following the pattern of the existing retry tests above (e.g. test_tool_output_function_retries), this test should also track and assert retries_log (the ctx.retry values) alongside max_retries_seen, and should also assert on result.all_messages() using snapshot() to validate the complete execution trace.
|
|
||
| self._output_toolset = self._output_schema.toolset | ||
| if self._output_toolset: | ||
| if self._output_toolset and self._output_toolset.max_retries is None: |
There was a problem hiding this comment.
@DouweM This change makes ToolOutput.max_retries control the per-tool retry limit (via ToolsetTool.max_retries), but the global output retry counter in _agent_graph.py still uses GraphAgentDeps.max_result_retries (always from the agent's _max_result_retries). This means:
ToolOutput(Foo, max_retries=7)withAgent(retries=1)→ per-tool limit is 7, but the globalincrement_retriescheck fires at retry 1, so the effective limit is still 1.ToolOutput(Foo, max_retries=2)withAgent(retries=10)→ per-tool limit fires at 2, global at 10, so effective limit is 2.
Before this PR both values were always the same (agent set both). Now they can diverge, but only the "lower wins" direction actually works. A user who sets ToolOutput(Foo, max_retries=5) without also bumping the agent's retries/output_retries to at least 5 will still get the agent default of 1.
Is this the intended semantic, or should ToolOutput.max_retries also flow into (or override) GraphAgentDeps.max_result_retries?
|
|
||
| # Use the per-tool max_retries so ToolOutput(max_retries=N) is respected, | ||
| # falling back to the agent-level max_result_retries for unknown tools. | ||
| max_retries = validated.tool.max_retries if validated.tool else ctx.deps.max_result_retries |
There was a problem hiding this comment.
This pattern was applied here and in the ToolRetryError handler below (line 1027), but the two UnexpectedModelBehavior handlers in this same function (lines 982 and 1019) still use ctx.deps.max_result_retries directly. The one at line 1019 has validated in scope and should use the same pattern for consistency. The one at line 982 fires before validated exists, but could look up the tool via tool_manager.get_tool_def(call.tool_name).
This means the fix only works for validation errors and ModelRetry/ToolRetryError, but not for UnexpectedModelBehavior errors on output tools — those will still be capped by the agent-level retries.
That said, as flagged in the other comment, the deeper question is whether this piecemeal approach is the right fix, or whether ToolOutput.max_retries should also flow into GraphAgentDeps.max_result_retries. @DouweM
| toolset=self, | ||
| tool_def=tool_def, | ||
| max_retries=self.max_retries, | ||
| max_retries=self.max_retries if self.max_retries is not None else 1, |
There was a problem hiding this comment.
The hardcoded fallback else 1 here means that even when OutputToolset.max_retries is None (no ToolOutput.max_retries was set), the ToolsetTool gets max_retries=1 before the agent has a chance to set OutputToolset.max_retries to the agent's default. This works because get_tools() is called after the agent sets OutputToolset.max_retries in __init__, but it's fragile — the else 1 is a dead code path since the agent always fills in None before this method runs. Consider removing the fallback and making the type int (not int | None) by the time get_tools() is called, to make the invariant explicit.
| description = output.description | ||
| strict = output.strict | ||
| if output.max_retries is not None: | ||
| max_retries = max(max_retries or 0, output.max_retries) |
There was a problem hiding this comment.
max(max_retries or 0, output.max_retries) treats max_retries=0 (if a previous ToolOutput set it explicitly) the same as None. This is probably fine since max_retries=0 doesn't make practical sense, but the idiomatic way to handle this for int | None is: max_retries = max(max_retries, output.max_retries) if max_retries is not None else output.max_retries — which is actually what the commenter on the issue suggested.
| assert max_retries_log == [target_retries] * (target_retries + 1) | ||
|
|
||
|
|
||
| def test_tool_output_max_retries_respected(): |
There was a problem hiding this comment.
This test should be removed or rewritten. It imports from pydantic_ai._output (a private module) and directly inspects OutputToolset.build() internals like toolset.max_retries — both of which violate the testing guideline of testing through public APIs, not private methods or helpers.
The behavioral test below (test_tool_output_max_retries_overrides_agent_retries) already covers the important case: that ToolOutput(max_retries=N) actually results in N retries at runtime. That's what matters. The extraction logic in build() is an implementation detail.
If you want to keep coverage of the "max across multiple ToolOutputs" case, add a behavioral test with output_type=[ToolOutput(Foo, max_retries=2), ToolOutput(Bar, max_retries=5)] and verify that 5 retries are actually allowed at runtime.
| description = output.description | ||
| strict = output.strict | ||
| if output.max_retries is not None: | ||
| max_retries = max(max_retries or 0, output.max_retries) |
There was a problem hiding this comment.
max(max_retries or 0, output.max_retries) conflates max_retries=0 with None. While max_retries=0 isn't practically useful, the idiomatic int | None pattern is:
max_retries = max(max_retries, output.max_retries) if max_retries is not None else output.max_retries(This was also suggested in the issue comment.)
| toolset=self, | ||
| tool_def=tool_def, | ||
| max_retries=self.max_retries, | ||
| max_retries=self.max_retries if self.max_retries is not None else 1, |
There was a problem hiding this comment.
The hardcoded else 1 fallback means that if OutputToolset.max_retries is still None when get_tools() is called, each ToolsetTool gets max_retries=1 regardless of what the agent's retries/output_retries was set to.
In practice, the agent sets output_toolset.max_retries = self._max_result_retries before get_tools() is called, so this fallback shouldn't be hit in normal usage. But if it ever is, 1 is the wrong default — it should match whatever the agent would have set.
Consider making the fallback more robust by asserting that max_retries is not None at this point (since the agent is supposed to have filled it in), or at least documenting why 1 is acceptable here.
| ctx.deps.max_result_retries, error=e, model_settings=ctx.deps.model_settings | ||
| ) | ||
| tool = tool_manager.tools.get(call.tool_name) if tool_manager.tools else None | ||
| max_retries = tool.max_retries if tool else ctx.deps.max_result_retries |
There was a problem hiding this comment.
The pattern max_retries = (...).tool.max_retries if (...).tool else ctx.deps.max_result_retries is repeated four times in this function (lines 983, 998, 1019, 1026). Per the repo's code style guidelines, duplicated logic should be consolidated into a helper. Consider extracting a small helper like:
def _get_max_retries(tool: ToolsetTool[DepsT] | None, ctx: GraphRunContext[...]) -> int:
return tool.max_retries if tool else ctx.deps.max_result_retriesOr at minimum, since two of the four sites have validated.tool and two have a tool looked up from tool_manager.tools, unify the lookup approach — the first handler (line 982) fetches the tool via tool_manager.tools.get(call.tool_name) while the other three use validated.tool. Using validated.tool consistently (where validated is available) would be cleaner.
|
|
||
| result = agent.run_sync('Hello') | ||
| assert result.output == 'Weather in Mexico City' | ||
| assert all(m == target_retries for m in max_retries_seen) |
There was a problem hiding this comment.
As flagged in an earlier review, this test should also track and assert ctx.retry values (a retries_log list), matching the pattern of the existing test_tool_output_function_retries test right above. This verifies that the retry counter is incrementing correctly on each attempt, not just that max_retries is visible.
Output tools now rely solely on the per-tool _check_max_retries in ToolManager (same as function tools), removing the redundant global increment_retries calls. This eliminates the shared counter issue where failures on one output tool could interfere with another's retry budget. The increment_retries mechanism is preserved for non-tool output retries: empty responses, text output validation, and unknown tools. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
| yield event | ||
| continue | ||
| ctx.state.increment_retries(ctx.deps.max_result_retries, error=e) | ||
| raise # pragma: lax no cover |
There was a problem hiding this comment.
@DouweM This is the key change in this PR that still needs your input: all four increment_retries calls have been removed from the output tool retry paths in process_tool_calls. The retry limiting for output tools is now handled entirely by ToolManager._check_max_retries, which uses per-tool retry tracking via RunContext.retries[tool_name] — the same mechanism that function tools already use.
This is a sound approach and makes output tools consistent with function tools, but it's a significant behavioral change beyond the original scope of the bug (which was just that ToolOutput.max_retries was silently ignored):
- The global
GraphAgentState.retriescounter is no longer incremented for output tool retries — it now only tracks text output retries (via_build_retry_node) and unknown tool calls. - The error message changes from
"Exceeded maximum retries (N) for output validation"to"Tool 'final_result' exceeded max retries count of N". - For the
UnexpectedModelBehaviorcases (lines 1294 and 1323), the old code would callincrement_retriesbefore re-raising, which could replace the original error with a max-retries error if the counter was already at the limit. Now the originalUnexpectedModelBehavioralways propagates directly.
The earlier auto-review comments on this topic (from the previous code revision) are now outdated, so I'm re-raising the question here on the current code.
| result = agent.run_sync('Hello') | ||
| assert result.output == 'Weather in Mexico City' | ||
| assert retries_log == [0, 1, 2, 3, 4, 5] | ||
| assert max_retries_log == [target_retries] * (target_retries + 1) |
There was a problem hiding this comment.
This test covers the single-output-tool case well — good use of retries_log, max_retries_log, and the all_messages() snapshot.
However, as flagged in earlier reviews, there should also be a test with multiple output tools that have different max_retries values, e.g. output_type=[ToolOutput(A, max_retries=5), ToolOutput(B, max_retries=2)]. This would:
- Verify that each tool's limit is respected independently (the whole point of the
_max_retries_overridesdict). - Confirm that a failure on tool A doesn't exhaust tool B's retry budget (since the per-tool tracking in
ToolManageris per tool name). - Exercise the code path in
get_tools()where_max_retries_overrides.get(tool_def.name, self.max_retries)returns different values for different tools.
| max_retries: int | ||
| max_retries: int | None | ||
| """Default max retries for output tools, set by the Agent. Per-tool overrides from `ToolOutput.max_retries` take priority.""" | ||
| _max_retries_overrides: dict[str, int] |
There was a problem hiding this comment.
The _max_retries_overrides dict creates a parallel data structure that maps tool names to their max_retries overrides, separate from the existing processors dict that also maps tool names to their processing info. Consider whether this could be simplified by storing the max_retries override on the ObjectOutputProcessor or alongside the ToolDefinition instead, to avoid needing to keep two name-keyed dicts in sync.
That said, this is a minor concern — the current approach works and the sync between the two dicts is straightforward since they're both built in the same loop in build().
The UnexpectedModelBehavior cause-unwrapping in increment_retries is no longer reachable since output tool retries no longer go through this path. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Verifies that multiple ToolOutputs with different max_retries values are tracked independently — output_a(max_retries=5) succeeds after 5 retries while output_b(max_retries=2) has a different limit, and Agent(retries=1) doesn't interfere. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
| yield event | ||
| continue | ||
| ctx.state.increment_retries(ctx.deps.max_result_retries, error=e) | ||
| raise # pragma: lax no cover |
There was a problem hiding this comment.
@DouweM This PR removes all four increment_retries calls from the output tool retry paths in process_tool_calls, making output tool retries entirely governed by the per-tool mechanism in ToolManager._check_max_retries (which uses ToolsetTool.max_retries and RunContext.retries).
This is a significant behavioral change beyond just "respecting ToolOutput.max_retries":
-
The global retry counter (
GraphAgentState.retries) is no longer incremented for output tool failures. This means the "Exceeded maximum retries (N) for output validation" error fromincrement_retrieswill never be raised for output tool retries — the error is now always "Tool 'X' exceeded max retries count of N" fromToolManager._check_max_retries. The test changes intest_model_test.pyconfirm this change in error messaging. -
GraphAgentDeps.max_result_retriesno longer acts as a cap on output tool retries inprocess_tool_calls— only the per-toolToolsetTool.max_retriesmatters now. -
Text output retries still use the global counter (via
_build_retry_nodeat line 883 and_handle_text_responseat line 1146), creating an inconsistency in how retries are tracked between text output and tool output.
The original issue (#4678) was that ToolOutput.max_retries was silently ignored. A more conservative fix would have been to pass the per-tool max_retries to increment_retries (as was done in earlier iterations of this PR) rather than removing the calls entirely. Could you weigh in on which approach is correct?
| if ( | ||
| isinstance(error, exceptions.UnexpectedModelBehavior) and error.__cause__ is not None | ||
| ): # pragma: no cover | ||
| error = error.__cause__ # pragma: no cover |
There was a problem hiding this comment.
Adding # pragma: no cover to work around reduced coverage violates the repo's testing guidelines — the preferred approach is to write tests that exercise this branch, or to remove the branch if it's truly unreachable. If removing the increment_retries calls from the output tool paths makes this branch unreachable, that's worth understanding and documenting rather than suppressing with pragmas.
|
|
||
| async def get_tools(self, ctx: RunContext[AgentDepsT]) -> dict[str, ToolsetTool[AgentDepsT]]: | ||
| # max_retries should always be set by the Agent before get_tools() is called | ||
| assert self.max_retries is not None, 'OutputToolset.max_retries must be set before get_tools() is called' |
There was a problem hiding this comment.
This assert will crash at runtime if max_retries hasn't been set by the Agent. Since max_retries starts as None and is only set in Agent.__init__, this creates a fragile coupling where OutputToolset can't be used independently (e.g. in tests or by advanced users building custom graphs). Per the repo guidelines, prefer a default over an assertion for this kind of invariant — for example, falling back to self.max_retries or 1 in the get() call on line 978 would be more resilient.
| assert result.output == 'A: hello' | ||
| assert retries_log == [0, 1, 2, 3, 4, 5] | ||
| assert max_retries_log == [5] * 6 | ||
|
|
There was a problem hiding this comment.
This test is a good addition for the multi-tool case, but it should include an assert result.all_messages() == snapshot(...) assertion to match the pattern established by test_tool_output_max_retries_overrides_agent_retries above and the existing test_tool_output_function_retries test. The all_messages() snapshot validates the complete execution trace (tool calls, retry prompts, etc.), not just the final output — this is important for catching regressions in the retry flow.
| max_retries: int | ||
| max_retries: int | None | ||
| """Default max retries for output tools, set by the Agent. Per-tool overrides from `ToolOutput.max_retries` take priority.""" | ||
| _max_retries_overrides: dict[str, int] |
There was a problem hiding this comment.
The _max_retries_overrides dict duplicates the key space of processors — both are keyed by tool name and built in the same loop. Consider storing the max_retries override on the ObjectOutputProcessor (or alongside it as a tuple) rather than maintaining a parallel dict. This would keep per-tool data co-located and avoid needing to keep two name-keyed structures in sync.
That said, this is a minor concern compared to the higher-level design question about removing increment_retries from the output tool paths.
| raise ModelRetry(f'Retry A {ctx.retry}') | ||
| return f'A: {value}' | ||
|
|
||
| def output_b(ctx: RunContext[None], value: str) -> str: # pragma: no cover |
There was a problem hiding this comment.
# pragma: no cover on output_b — if this function is truly never called (the model always picks output_a), this test doesn't actually exercise the per-tool isolation. A stronger test would have the model call output_b at some point and verify it hits its retry limit (2) independently of output_a's limit (5). As written, this test only demonstrates that output_a gets its override; it doesn't verify that output_b gets a different override.
- Remove dead UnexpectedModelBehavior cause-unwrapping branch from increment_retries (no longer reachable after removing output tool increment_retries calls) - Replace assert with fallback in OutputToolset.get_tools() - Add multi-output test verifying per-tool max_retries isolation with snapshot assertion - Fix stale snapshots for exhaustive strategy tests Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
| # Validation passed - execute the tool | ||
| try: | ||
| result_data = await tool_manager.execute_tool_call(validated) | ||
| except exceptions.UnexpectedModelBehavior as e: | ||
| except exceptions.UnexpectedModelBehavior: | ||
| if final_result: | ||
| for event in _emit_skipped_output_tool( | ||
| call, 'Output tool not used - output function execution failed.', output_parts, args_valid=True | ||
| ): | ||
| yield event | ||
| continue | ||
| ctx.state.increment_retries(ctx.deps.max_result_retries, error=e) | ||
| raise # pragma: lax no cover | ||
| except ToolRetryError as e: | ||
| # If we already have a valid final result, don't increment retries for invalid output tools | ||
| # This allows the run to succeed if at least one output tool returned a valid result | ||
| if not final_result: | ||
| ctx.state.increment_retries(ctx.deps.max_result_retries, error=e) | ||
| yield _messages.FunctionToolCallEvent(call, args_valid=True) | ||
| output_parts.append(e.tool_retry) | ||
| yield _messages.FunctionToolResultEvent(e.tool_retry) |
There was a problem hiding this comment.
📝 Info: Behavioral change: output tool retries no longer affect the global retry counter
This PR removes all ctx.state.increment_retries() calls for output tools in process_tool_calls (_agent_graph.py:1289-1333). Previously, output tool failures incremented BOTH the per-tool counter (via ToolManager._check_max_retries) AND the global counter (GraphAgentState.retries). Now only per-tool tracking applies.
This means:
- Old: 3 output tools each failing once → global counter reaches 3, could trigger
max_result_retries - New: each tool's failures tracked independently; global counter unaffected
The global counter (ctx.state.retries) is still incremented for empty responses (line 985), text response retries (line 1057), model-level retries (line 876), and unknown tools (line 1359). This is an intentional semantic change confirmed by the updated tests, but worth noting for reviewers that the max_result_retries limit no longer constrains output tool retries — only ToolOutput.max_retries (or the agent-level default) does.
Was this helpful? React with 👍 or 👎 to provide feedback.
- Fix `self.max_retries or 1` treating 0 as None (use explicit None check) - Add tests for exhaustive strategy skip paths when output tool exceeds per-tool retry limit while another tool succeeded - Fix stale snapshots for exhaustive strategy tests Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Avoid if/elif loop that creates an untaken branch — use dict comprehension and next() lookup instead. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Wrap UnexpectedModelBehavior from _check_max_retries with the original "Exceeded maximum retries (N) for output validation" message so output tool retry errors are distinguishable from function tool errors. Add docstring to ToolOutput.max_retries explaining its relationship to Agent retries/output_retries. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
ToolOutput.max_retries parameter
Extract check_incomplete_tool_call() from increment_retries and call it at both output tool re-raise points, so truncated model responses (finish_reason='length') still produce the actionable IncompleteToolCall error suggesting users increase max_tokens. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Co-authored-by: Douwe Maan <[email protected]> Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]>
…se tests - OutputToolset.for_run_step: use copy() instead of replace() since @DataClass(init=False) with custom __init__ makes replace() fail (field names don't match init param names) - process_tool_calls: replace increment_retries() with bare retries += 1 to align with upstream pydantic#4687 which removed global limit checks in favor of per-tool enforcement via ToolManager._check_max_retries - Update negative test to not use per-tool override that would bypass the output_retries default - Add edge case tests: concurrent run isolation, retry counter across tool switches, validation vs execution failure retry counting Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Co-authored-by: Douwe Maan <[email protected]> Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]>
Summary
ToolOutput(Foo, max_retries=3)accepts amax_retriesparameter, but the value was silently ignored:OutputToolset.build()never extracted it fromToolOutputinstancesOutputToolset.max_retrieswith its ownretries/output_retriesincrement_retriescheck used the agent's retry limit instead of the per-tool limit, blockingToolOutput(max_retries=N)from working whenN > Agent(retries)Changes
max_retriesfromToolOutputinOutputToolset.build(), takingmax()across multipleToolOutputsOutputToolset.max_retriesanint | None(defaultNone) so the Agent only applies its default when noToolOutputspecified a valuemax_retriestoincrement_retriesfor output tool retries, matching howTool(retries=N)works for function toolsToolOutput(max_retries=N)now genuinely allows N retries even whenAgent(retries)is lower, treatingAgent(retries)/Agent(output_retries)as a default for output tools and non-tool output.Fixes #4678
Test plan
test_tool_output_max_retries_respected— testsOutputToolset.build()extraction: explicit value, default (None), and multiple ToolOutputs (max wins)test_tool_output_max_retries_overrides_agent_retries— behavioral test:ToolOutput(max_retries=5)withAgent(retries=2)allows all 5 retries andctx.max_retriesreflects the ToolOutput value🤖 Generated with Claude Code