Skip to content

fix: restore per-block cache_control for anthropic_cache_messages#5227

Merged
DouweM merged 6 commits intopydantic:mainfrom
Wh1isper:fix/anthropic-cache-messages-per-block
Apr 28, 2026
Merged

fix: restore per-block cache_control for anthropic_cache_messages#5227
DouweM merged 6 commits intopydantic:mainfrom
Wh1isper:fix/anthropic-cache-messages-per-block

Conversation

@Wh1isper
Copy link
Copy Markdown
Contributor

@Wh1isper Wh1isper commented Apr 28, 2026

Summary

This PR restores anthropic_cache_messages as a per-block Anthropic cache_control setting for the final message content block.

PR #4840 introduced Anthropic automatic caching via the top-level cache_control parameter and changed anthropic_cache_messages to map to that new behavior with a deprecation warning. That made sense for the official Anthropic API, where automatic caching is the recommended simple path for multi-turn conversations, but it changed the existing behavior of an established Pydantic AI setting.

The previous behavior still matters because several Anthropic-compatible providers and proxy layers continue to use the explicit per-block Anthropic cache format. Restoring it keeps the original setting useful for those integrations while preserving anthropic_cache for Anthropic's top-level automatic caching.

Problem

anthropic_cache_messages previously added explicit cache_control metadata to message content blocks. After #4840, enabling it produced a top-level cache_control request parameter instead.

That created a breaking behavior change for users whose provider expects explicit per-block cache control in the Anthropic message body:

{
  "role": "user",
  "content": [
    {
      "type": "text",
      "text": "...",
      "cache_control": {"type": "ephemeral", "ttl": "5m"}
    }
  ]
}

Common affected scenarios include:

  • Anthropic-compatible providers that implement explicit prompt caching in the message format.
  • Gateways and proxy layers that translate Vertex AI or Bedrock APIs into Anthropic-compatible request bodies.
  • Routing providers where top-level automatic caching has provider-specific support, while explicit block-level cache controls remain the portable Anthropic-compatible format.

Evidence from provider documentation

The explicit per-block format is still used across current provider and gateway documentation:

Change

This PR makes the two settings distinct:

It also keeps the conflict check between the two settings, since a request should use one message caching strategy at a time.

Tests

Added and updated Anthropic model tests for:

  • anthropic_cache_messages=True and custom TTL values adding per-block cache_control.
  • Preserving an existing explicit CachePoint cache control.
  • Raising UserError when anthropic_cache and anthropic_cache_messages are enabled together.
  • Cache point budgeting when message-level cache control is used.

Command run:

uv run pytest tests/models/test_anthropic.py -k 'cache_messages or anthropic_cache_fallback_on_unsupported_clients or limit_cache_points'

Result:

11 passed

Checklist

  • Any AI generated code has been reviewed line-by-line by the human PR author, who stands by it.
  • No breaking changes in accordance with the version policy.
  • PR title is fit for the release changelog.

@Wh1isper Wh1isper changed the title fix: restore per-block Anthropic message cache control Restore per-block Anthropic message cache control Apr 28, 2026
@github-actions github-actions Bot added size: M Medium PR (101-500 weighted lines) bug Report that something isn't working, or PR implementing a fix labels Apr 28, 2026
@Wh1isper
Copy link
Copy Markdown
Contributor Author

@DouweM @DenysMoskalenko Sorry for the ping. Since the previous PR introduced a breaking change, I'd really appreciate it if we could prioritize this PR. Thanks!

## Summary

- Restore `anthropic_cache_messages` as a per-block cache control option for final message content.
- Document usage for Anthropic-compatible gateways and providers.
- Update Anthropic cache tests for message cache behavior and cache-point limits.

Tests: `uv run pytest tests/models/test_anthropic.py -k 'cache_messages or anthropic_cache_fallback_on_unsupported_clients or limit_cache_points'`

Assisted-by: YAAI <[email protected]>
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 potential issue.

View 3 additional findings in Devin Review.

Open in Devin Review

Comment on lines 241 to 251
anthropic_cache_messages: bool | Literal['5m', '1h']
"""Deprecated: use `anthropic_cache` instead.
"""Whether to add `cache_control` to the last message content block.

Behaves the same as `anthropic_cache`: uses automatic caching where supported,
falls back to per-block caching on Bedrock and Vertex. Emits a deprecation warning.
When enabled, this adds per-block `cache_control` to the last content block in the
final message. This is useful for Anthropic-compatible providers and gateways that
support explicit per-block caching but don't support Anthropic's top-level automatic
caching parameter.

If `True`, uses TTL='5m'. You can also specify '5m' or '1h' directly.
Cannot be combined with `anthropic_cache`.
"""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚩 Behavioral change for existing anthropic_cache_messages users (V1 backward compatibility)

This PR changes the semantics of anthropic_cache_messages from a deprecated alias of anthropic_cache (top-level automatic caching) to a distinct per-block caching feature. Previously, anthropic_cache_messages=True on a standard Anthropic client would emit a deprecation warning and set the top-level cache_control parameter for server-side automatic caching. Now it adds per-block cache_control only to the last content block of the last message — a meaningfully different caching behavior.

The version policy at docs/version-policy.md:7 states: "Functionality marked as deprecated will not be removed until V2." While the field isn't removed, its behavior is fundamentally changed. Users who had anthropic_cache_messages=True and hadn't yet migrated to anthropic_cache per the deprecation warning will silently get different caching behavior. On Bedrock/Vertex the behavior happens to be the same (both old and new do per-block fallback), but on the standard Anthropic API the difference is significant.

Given we're at v1.87.0 and the current date is April 2026 (the earliest V2 release date), this may be intentional preparation for V2. Worth confirming with maintainers whether this is acceptable in V1 or should wait for a V2 release.

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the two should coexist. They are not substitutes for each other, although in some APIs one can replace the other, because their semantics are different.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The V1 backward-compat lens here is what motivated the PR: #4840 itself was the regression, changing anthropic_cache_messages from its original per-block behavior to a deprecation alias of the new top-level anthropic_cache. That broke users on Anthropic-compatible gateways and proxies that depend on the per-block cache_control format (Bedrock, Vertex partner surface, MiniMax, OpenRouter, LiteLLM). Restoring the original per-block semantics fixes that regression — the alias window was short, and anyone who had migrated into the alias saw a deprecation warning steering them to anthropic_cache already.

Agreed with @Wh1isper that the two should coexist with distinct semantics: top-level automatic caching vs. per-block message caching. Keeping the conflict-check between them keeps misuse loud.

@Wh1isper Wh1isper force-pushed the fix/anthropic-cache-messages-per-block branch from 9d786ed to 23dd692 Compare April 28, 2026 13:13
@DouweM
Copy link
Copy Markdown
Collaborator

DouweM commented Apr 28, 2026

@Wh1isper Sorry for the breaking change, thanks for the report & fix! I'll take it over the line.

…essages-per-block

# Conflicts:
#	pydantic_ai_slim/pydantic_ai/models/anthropic.py
@DouweM DouweM self-assigned this Apr 28, 2026
- Extract `_apply_message_cache_control` shared between `_map_message` and `_apply_per_block_caching_fallback`
- Frame `anthropic_cache_messages` docs as the gateway-compatible alternative to `anthropic_cache`, with explicit mutual exclusion
Comment on lines +1203 to +1206
if cache_messages := model_settings.get('anthropic_cache_messages'):
self._apply_message_cache_control(
anthropic_messages, '5m' if cache_messages is True else cache_messages
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The anthropic_cache_messages per-block caching is applied here inside _map_message, while the analogous per-block fallback for anthropic_cache (Bedrock/Vertex) is applied via _apply_per_block_caching_fallback as a separate step after _map_message returns in _messages_create / _messages_count_tokens. Both end up calling _apply_message_cache_control, but the split makes the caching pipeline harder to follow.

Consider moving this out of _map_message and into _messages_create / _messages_count_tokens as a separate step alongside _apply_per_block_caching_fallback, so all message-level caching is visible and sequenced together in the orchestration method. (Yes, this means it needs to appear in both callsites, but _apply_per_block_caching_fallback already does, so the pattern exists.)

Comment on lines 1299 to +1342
@@ -1305,84 +1307,39 @@ async def test_anthropic_cache_messages_deprecated_custom_ttl(allow_model_reques
m,
system_prompt='System instructions.',
model_settings=AnthropicModelSettings(
anthropic_cache_messages='1h',
),
)

with pytest.warns(DeprecationWarning, match='`anthropic_cache_messages` is deprecated'):
await agent.run('User message')

completion_kwargs = get_mock_chat_completion_kwargs(mock_client)[0]
assert completion_kwargs['cache_control'] == snapshot({'type': 'ephemeral', 'ttl': '1h'})


async def test_anthropic_cache_and_cache_messages_conflict(allow_model_requests: None):
"""Test that enabling both anthropic_cache and anthropic_cache_messages raises UserError."""
c = completion_message(
[BetaTextBlock(text='Response', type='text')],
usage=BetaUsage(input_tokens=10, output_tokens=5),
)
mock_client = MockAnthropic.create_mock(c)
m = AnthropicModel('claude-haiku-4-5', provider=AnthropicProvider(anthropic_client=mock_client))
agent = Agent(
m,
system_prompt='System instructions.',
model_settings=AnthropicModelSettings(
anthropic_cache=True,
anthropic_cache_messages=True,
),
)

with pytest.raises(UserError, match='cannot both be enabled'):
await agent.run('User message')


async def test_limit_cache_points_with_deprecated_cache_messages(allow_model_requests: None):
"""Test that deprecated anthropic_cache_messages maps to anthropic_cache for cache point limiting."""
c = completion_message(
[BetaTextBlock(text='Response', type='text')],
usage=BetaUsage(input_tokens=10, output_tokens=5),
)
mock_client = MockAnthropic.create_mock(c)
m = AnthropicModel('claude-haiku-4-5', provider=AnthropicProvider(anthropic_client=mock_client))
agent = Agent(
m,
system_prompt='System instructions.',
model_settings=AnthropicModelSettings(
anthropic_cache_messages=True,
),
await agent.run(
[
'Context 1',
CachePoint(),
'Context 2',
CachePoint(),
'Context 3',
CachePoint(),
'Question',
]
)

# anthropic_cache_messages now maps to anthropic_cache (top-level cache_control),
# which reduces the explicit cache point budget from 4 to 3.
# With 4 CachePoint markers, the oldest should be removed to fit budget of 3.
with pytest.warns(DeprecationWarning, match='`anthropic_cache_messages` is deprecated'):
await agent.run(
[
'Context 1',
CachePoint(), # Oldest, should be removed
'Context 2',
CachePoint(), # Should be kept
'Context 3',
CachePoint(), # Should be kept
'Context 4',
CachePoint(), # Should be kept
'Question',
]
)

completion_kwargs = get_mock_chat_completion_kwargs(mock_client)[0]
messages = completion_kwargs['messages']
assert completion_kwargs['cache_control'] == {'type': 'ephemeral', 'ttl': '5m'}
assert completion_kwargs['cache_control'] is OMIT

cache_count = 0
for msg in messages:
for block in msg['content']:
if 'cache_control' in block:
cache_count += 1

# Budget is 3 (reduced from 4 by automatic caching). 4 CachePoint markers means 1 removed.
assert cache_count == 3
assert messages == snapshot(
[
{
'role': 'user',
'content': [
{'text': 'Context 1', 'type': 'text', 'cache_control': {'type': 'ephemeral', 'ttl': '5m'}},
{'text': 'Context 2', 'type': 'text', 'cache_control': {'type': 'ephemeral', 'ttl': '5m'}},
{'text': 'Context 3', 'type': 'text', 'cache_control': {'type': 'ephemeral', 'ttl': '5m'}},
{'text': 'Question', 'type': 'text', 'cache_control': {'type': 'ephemeral', 'ttl': '5m'}},
],
}
]
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test has exactly 4 cache points (3 CachePoint + 1 from anthropic_cache_messages), which matches the budget of 4 exactly — so nothing is actually trimmed. The test name says "limit" but it's really only verifying that anthropic_cache_messages cache points are counted in the budget.

To actually test the limiting/trimming behavior, add enough CachePoint markers that the total (explicit + anthropic_cache_messages) exceeds 4, and verify the oldest explicit CachePoint is removed. That's what the old test_limit_cache_points_with_deprecated_cache_messages was testing (5 total, trimmed to 3 because automatic caching reduced the budget).

DouweM added 3 commits April 28, 2026 15:41
Both call sites (`_map_message` and `_apply_per_block_caching_fallback`) pass the
just-built request message list, which is never empty in practice. Document the
precondition in the docstring instead.
- Move `anthropic_cache_messages` per-block injection out of `_map_message` and into the request orchestration sites alongside `_apply_per_block_caching_fallback`, so all message-level caching is visible together
- Extend `test_limit_cache_points_with_cache_messages` to actually exceed the 4-point budget so the trimming behavior is exercised
@DouweM DouweM changed the title Restore per-block Anthropic message cache control fix: restore per-block cache_control for anthropic_cache_messages Apr 28, 2026
@DouweM DouweM merged commit 88566ae into pydantic:main Apr 28, 2026
45 checks passed
@Wh1isper Wh1isper deleted the fix/anthropic-cache-messages-per-block branch April 29, 2026 02:12
Alex-Resch pushed a commit to Alex-Resch/pydantic-ai that referenced this pull request Apr 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Report that something isn't working, or PR implementing a fix size: M Medium PR (101-500 weighted lines)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants