fix: restore per-block cache_control for anthropic_cache_messages#5227
Conversation
|
@DouweM @DenysMoskalenko Sorry for the ping. Since the previous PR introduced a breaking change, I'd really appreciate it if we could prioritize this PR. Thanks! |
## Summary - Restore `anthropic_cache_messages` as a per-block cache control option for final message content. - Document usage for Anthropic-compatible gateways and providers. - Update Anthropic cache tests for message cache behavior and cache-point limits. Tests: `uv run pytest tests/models/test_anthropic.py -k 'cache_messages or anthropic_cache_fallback_on_unsupported_clients or limit_cache_points'` Assisted-by: YAAI <[email protected]>
| anthropic_cache_messages: bool | Literal['5m', '1h'] | ||
| """Deprecated: use `anthropic_cache` instead. | ||
| """Whether to add `cache_control` to the last message content block. | ||
|
|
||
| Behaves the same as `anthropic_cache`: uses automatic caching where supported, | ||
| falls back to per-block caching on Bedrock and Vertex. Emits a deprecation warning. | ||
| When enabled, this adds per-block `cache_control` to the last content block in the | ||
| final message. This is useful for Anthropic-compatible providers and gateways that | ||
| support explicit per-block caching but don't support Anthropic's top-level automatic | ||
| caching parameter. | ||
|
|
||
| If `True`, uses TTL='5m'. You can also specify '5m' or '1h' directly. | ||
| Cannot be combined with `anthropic_cache`. | ||
| """ |
There was a problem hiding this comment.
🚩 Behavioral change for existing anthropic_cache_messages users (V1 backward compatibility)
This PR changes the semantics of anthropic_cache_messages from a deprecated alias of anthropic_cache (top-level automatic caching) to a distinct per-block caching feature. Previously, anthropic_cache_messages=True on a standard Anthropic client would emit a deprecation warning and set the top-level cache_control parameter for server-side automatic caching. Now it adds per-block cache_control only to the last content block of the last message — a meaningfully different caching behavior.
The version policy at docs/version-policy.md:7 states: "Functionality marked as deprecated will not be removed until V2." While the field isn't removed, its behavior is fundamentally changed. Users who had anthropic_cache_messages=True and hadn't yet migrated to anthropic_cache per the deprecation warning will silently get different caching behavior. On Bedrock/Vertex the behavior happens to be the same (both old and new do per-block fallback), but on the standard Anthropic API the difference is significant.
Given we're at v1.87.0 and the current date is April 2026 (the earliest V2 release date), this may be intentional preparation for V2. Worth confirming with maintainers whether this is acceptable in V1 or should wait for a V2 release.
Was this helpful? React with 👍 or 👎 to provide feedback.
There was a problem hiding this comment.
I think the two should coexist. They are not substitutes for each other, although in some APIs one can replace the other, because their semantics are different.
There was a problem hiding this comment.
The V1 backward-compat lens here is what motivated the PR: #4840 itself was the regression, changing anthropic_cache_messages from its original per-block behavior to a deprecation alias of the new top-level anthropic_cache. That broke users on Anthropic-compatible gateways and proxies that depend on the per-block cache_control format (Bedrock, Vertex partner surface, MiniMax, OpenRouter, LiteLLM). Restoring the original per-block semantics fixes that regression — the alias window was short, and anyone who had migrated into the alias saw a deprecation warning steering them to anthropic_cache already.
Agreed with @Wh1isper that the two should coexist with distinct semantics: top-level automatic caching vs. per-block message caching. Keeping the conflict-check between them keeps misuse loud.
9d786ed to
23dd692
Compare
|
@Wh1isper Sorry for the breaking change, thanks for the report & fix! I'll take it over the line. |
…essages-per-block # Conflicts: # pydantic_ai_slim/pydantic_ai/models/anthropic.py
- Extract `_apply_message_cache_control` shared between `_map_message` and `_apply_per_block_caching_fallback` - Frame `anthropic_cache_messages` docs as the gateway-compatible alternative to `anthropic_cache`, with explicit mutual exclusion
| if cache_messages := model_settings.get('anthropic_cache_messages'): | ||
| self._apply_message_cache_control( | ||
| anthropic_messages, '5m' if cache_messages is True else cache_messages | ||
| ) |
There was a problem hiding this comment.
The anthropic_cache_messages per-block caching is applied here inside _map_message, while the analogous per-block fallback for anthropic_cache (Bedrock/Vertex) is applied via _apply_per_block_caching_fallback as a separate step after _map_message returns in _messages_create / _messages_count_tokens. Both end up calling _apply_message_cache_control, but the split makes the caching pipeline harder to follow.
Consider moving this out of _map_message and into _messages_create / _messages_count_tokens as a separate step alongside _apply_per_block_caching_fallback, so all message-level caching is visible and sequenced together in the orchestration method. (Yes, this means it needs to appear in both callsites, but _apply_per_block_caching_fallback already does, so the pattern exists.)
| @@ -1305,84 +1307,39 @@ async def test_anthropic_cache_messages_deprecated_custom_ttl(allow_model_reques | |||
| m, | |||
| system_prompt='System instructions.', | |||
| model_settings=AnthropicModelSettings( | |||
| anthropic_cache_messages='1h', | |||
| ), | |||
| ) | |||
|
|
|||
| with pytest.warns(DeprecationWarning, match='`anthropic_cache_messages` is deprecated'): | |||
| await agent.run('User message') | |||
|
|
|||
| completion_kwargs = get_mock_chat_completion_kwargs(mock_client)[0] | |||
| assert completion_kwargs['cache_control'] == snapshot({'type': 'ephemeral', 'ttl': '1h'}) | |||
|
|
|||
|
|
|||
| async def test_anthropic_cache_and_cache_messages_conflict(allow_model_requests: None): | |||
| """Test that enabling both anthropic_cache and anthropic_cache_messages raises UserError.""" | |||
| c = completion_message( | |||
| [BetaTextBlock(text='Response', type='text')], | |||
| usage=BetaUsage(input_tokens=10, output_tokens=5), | |||
| ) | |||
| mock_client = MockAnthropic.create_mock(c) | |||
| m = AnthropicModel('claude-haiku-4-5', provider=AnthropicProvider(anthropic_client=mock_client)) | |||
| agent = Agent( | |||
| m, | |||
| system_prompt='System instructions.', | |||
| model_settings=AnthropicModelSettings( | |||
| anthropic_cache=True, | |||
| anthropic_cache_messages=True, | |||
| ), | |||
| ) | |||
|
|
|||
| with pytest.raises(UserError, match='cannot both be enabled'): | |||
| await agent.run('User message') | |||
|
|
|||
|
|
|||
| async def test_limit_cache_points_with_deprecated_cache_messages(allow_model_requests: None): | |||
| """Test that deprecated anthropic_cache_messages maps to anthropic_cache for cache point limiting.""" | |||
| c = completion_message( | |||
| [BetaTextBlock(text='Response', type='text')], | |||
| usage=BetaUsage(input_tokens=10, output_tokens=5), | |||
| ) | |||
| mock_client = MockAnthropic.create_mock(c) | |||
| m = AnthropicModel('claude-haiku-4-5', provider=AnthropicProvider(anthropic_client=mock_client)) | |||
| agent = Agent( | |||
| m, | |||
| system_prompt='System instructions.', | |||
| model_settings=AnthropicModelSettings( | |||
| anthropic_cache_messages=True, | |||
| ), | |||
| await agent.run( | |||
| [ | |||
| 'Context 1', | |||
| CachePoint(), | |||
| 'Context 2', | |||
| CachePoint(), | |||
| 'Context 3', | |||
| CachePoint(), | |||
| 'Question', | |||
| ] | |||
| ) | |||
|
|
|||
| # anthropic_cache_messages now maps to anthropic_cache (top-level cache_control), | |||
| # which reduces the explicit cache point budget from 4 to 3. | |||
| # With 4 CachePoint markers, the oldest should be removed to fit budget of 3. | |||
| with pytest.warns(DeprecationWarning, match='`anthropic_cache_messages` is deprecated'): | |||
| await agent.run( | |||
| [ | |||
| 'Context 1', | |||
| CachePoint(), # Oldest, should be removed | |||
| 'Context 2', | |||
| CachePoint(), # Should be kept | |||
| 'Context 3', | |||
| CachePoint(), # Should be kept | |||
| 'Context 4', | |||
| CachePoint(), # Should be kept | |||
| 'Question', | |||
| ] | |||
| ) | |||
|
|
|||
| completion_kwargs = get_mock_chat_completion_kwargs(mock_client)[0] | |||
| messages = completion_kwargs['messages'] | |||
| assert completion_kwargs['cache_control'] == {'type': 'ephemeral', 'ttl': '5m'} | |||
| assert completion_kwargs['cache_control'] is OMIT | |||
|
|
|||
| cache_count = 0 | |||
| for msg in messages: | |||
| for block in msg['content']: | |||
| if 'cache_control' in block: | |||
| cache_count += 1 | |||
|
|
|||
| # Budget is 3 (reduced from 4 by automatic caching). 4 CachePoint markers means 1 removed. | |||
| assert cache_count == 3 | |||
| assert messages == snapshot( | |||
| [ | |||
| { | |||
| 'role': 'user', | |||
| 'content': [ | |||
| {'text': 'Context 1', 'type': 'text', 'cache_control': {'type': 'ephemeral', 'ttl': '5m'}}, | |||
| {'text': 'Context 2', 'type': 'text', 'cache_control': {'type': 'ephemeral', 'ttl': '5m'}}, | |||
| {'text': 'Context 3', 'type': 'text', 'cache_control': {'type': 'ephemeral', 'ttl': '5m'}}, | |||
| {'text': 'Question', 'type': 'text', 'cache_control': {'type': 'ephemeral', 'ttl': '5m'}}, | |||
| ], | |||
| } | |||
| ] | |||
| ) | |||
There was a problem hiding this comment.
This test has exactly 4 cache points (3 CachePoint + 1 from anthropic_cache_messages), which matches the budget of 4 exactly — so nothing is actually trimmed. The test name says "limit" but it's really only verifying that anthropic_cache_messages cache points are counted in the budget.
To actually test the limiting/trimming behavior, add enough CachePoint markers that the total (explicit + anthropic_cache_messages) exceeds 4, and verify the oldest explicit CachePoint is removed. That's what the old test_limit_cache_points_with_deprecated_cache_messages was testing (5 total, trimmed to 3 because automatic caching reduced the budget).
Both call sites (`_map_message` and `_apply_per_block_caching_fallback`) pass the just-built request message list, which is never empty in practice. Document the precondition in the docstring instead.
- Move `anthropic_cache_messages` per-block injection out of `_map_message` and into the request orchestration sites alongside `_apply_per_block_caching_fallback`, so all message-level caching is visible together - Extend `test_limit_cache_points_with_cache_messages` to actually exceed the 4-point budget so the trimming behavior is exercised
cache_control for anthropic_cache_messages
…pydantic#5227) Co-authored-by: Douwe Maan <[email protected]>
Summary
This PR restores
anthropic_cache_messagesas a per-block Anthropiccache_controlsetting for the final message content block.PR #4840 introduced Anthropic automatic caching via the top-level
cache_controlparameter and changedanthropic_cache_messagesto map to that new behavior with a deprecation warning. That made sense for the official Anthropic API, where automatic caching is the recommended simple path for multi-turn conversations, but it changed the existing behavior of an established Pydantic AI setting.The previous behavior still matters because several Anthropic-compatible providers and proxy layers continue to use the explicit per-block Anthropic cache format. Restoring it keeps the original setting useful for those integrations while preserving
anthropic_cachefor Anthropic's top-level automatic caching.Problem
anthropic_cache_messagespreviously added explicitcache_controlmetadata to message content blocks. After #4840, enabling it produced a top-levelcache_controlrequest parameter instead.That created a breaking behavior change for users whose provider expects explicit per-block cache control in the Anthropic message body:
{ "role": "user", "content": [ { "type": "text", "text": "...", "cache_control": {"type": "ephemeral", "ttl": "5m"} } ] }Common affected scenarios include:
Evidence from provider documentation
The explicit per-block format is still used across current provider and gateway documentation:
cache_controlsettings: https://platform.minimax.io/docs/api-reference/anthropic-api-compatible-cacheInvokeModelrequests usingcache_controlon content blocks insidemessages, with supported fields includingsystem,messages, andtools: https://docs.aws.amazon.com/bedrock/latest/userguide/prompt-caching.htmlcache_controlas part of the matching cache key: https://docs.cloud.google.com/vertex-ai/generative-ai/docs/partner-models/claude/prompt-cachingChange
This PR makes the two settings distinct:
anthropic_cachekeeps the top-level automatic caching behavior introduced by Add Anthropic automatic prompt caching support #4840.anthropic_cache_messagesagain adds per-blockcache_controlto the final message content block.It also keeps the conflict check between the two settings, since a request should use one message caching strategy at a time.
Tests
Added and updated Anthropic model tests for:
anthropic_cache_messages=Trueand custom TTL values adding per-blockcache_control.CachePointcache control.UserErrorwhenanthropic_cacheandanthropic_cache_messagesare enabled together.Command run:
uv run pytest tests/models/test_anthropic.py -k 'cache_messages or anthropic_cache_fallback_on_unsupported_clients or limit_cache_points'Result:
Checklist