feat: cross-provider service_tier model setting; Anthropic + Gemini API + Vertex Priority PayGo support#4926
Conversation
| 'pt_only', | ||
| 'pt_then_flex', | ||
| 'on_demand', | ||
| 'flex_only', |
There was a problem hiding this comment.
@markmcd Is there any way we could make (at least some of) the same values work for both GLA and Vertex?
cc @ewjoachim
There was a problem hiding this comment.
I'll wait for the Vertex opinion on this one. The GLA values align with what other major providers do, so I'd prefer to remap the Vertex values back (if that's even feasible?)
There was a problem hiding this comment.
These are the relevant docs:
- https://docs.cloud.google.com/vertex-ai/generative-ai/docs/provisioned-throughput/use-provisioned-throughput
- https://docs.cloud.google.com/vertex-ai/generative-ai/docs/flex-paygo
They control the following headers:
| Tier Name | X-Vertex-AI-LLM-Request-Type |
X-Vertex-AI-LLM-Shared-Request-Type |
Description / Behavior |
|---|---|---|---|
pt_then_on_demand |
(Not sent) | (Not sent) | Default Behavior: Uses PT first. Excess traffic spills over to standard PayGo. |
pt_only |
dedicated |
(Not sent) | Provisioned Only: Uses PT exclusively. Rejects traffic with a 429 error if capacity is exceeded. |
pt_then_flex |
(Not sent) | flex |
Hybrid Flex: Uses PT first. Excess traffic spills over to the lower-cost Flex PayGo tier. |
on_demand |
shared |
(Not sent) | Pure On-demand: Bypasses PT to use standard shared resources at regular rates. |
flex_only |
shared |
flex |
Pure Flex: Bypasses PT to use the discounted Flex PayGo tier directly. |
We could definitly use standard and flex though it's ambiguous if they should map to equivalent with or without PT. That said, by default, PT is used unless we send X-Vertex-AI-LLM-Request-Type: shared so it could make sense to:
- Replace
pt_then_on_demand->standardandpt_then_flex->flex, but then we would need to rename at leaston_demandto somthing likeon_demand_onlyorpay_as_you_go_only(to match flex_only`) - Or we could treat
standardandflexas aliases forpt_then_on_demandandpt_then_flexbut keep the non-ambiguous names around, it would mean a bit of extra documentation work to convey this in a very unambiguous manner, but this lets folks write unambiguous code.
priority doesn't match to anything on vertex and pt doesn't match to anything on Google (as far as I can tell), so it will be hard to get anything perfect. Not sure exactly if that's 100% helpful, but if you think I can help further, feel free :)
There was a problem hiding this comment.
My vote is for unambiguous options, and then provide the additional aliases.
The GLA impl aligns with, e.g. OpenAI, and while that doesn't happen transparently in this case, it's a nice portability feature. I think providing similar ergonomics for the Vertex values would be positive, as long as that's a reasonable mental model for Vertex customers (hopefully you can make that call @ewjoachim! I don't know that stack well)
If we agree on this, I'll update the PR to make it clear that standard and flex are accepted for vertex as a shim only.
There was a problem hiding this comment.
- Since we support service tiers now for OpenAI, Bedrock, GLA, and Vertex, it'd be nice to add a new top-level
service_tierModelSettingwith a narrow set of values (most likely the OpenAI ones) that we then try to map to providers (i.e. interpret in the model classes) as best we can, with clear documentation (in the docstring) of how each provider interprets them. - If we have a narrower set there, we could then rename the
google_service_tierfield togoogle_vertex_service_tier(and deprecate the original). Then we may either not need a separategoogle_gla_service_tier(if the top levelservice_tiercovers all the values), or we can add a newgoogle_gla_service_tierin case granular control is needed.
That way we get the convenience of a single set of values across providers, with the ability to override per-provider values as needed.
I don't have an opinion yet on what the exact top-level service tier values, or mapping to Vertex, should be, but if the above approach makes sense to you I trust either/both of you to be able to come up with some reasonable 😄
There was a problem hiding this comment.
OK I've updated to take this into account using OpenAI's values as the "default". Provider-specific values take precedence, and supported providers have been updated, including adding mappings from generic to specific where it makes sense to.
Another PR has appeared that also addresses this, #5158, I haven't looked at it, but I'm not precious about keeping mine if that's better.
There was a problem hiding this comment.
@markmcd Thanks Mark, I've been working with an agent on consolidating these related PRs so I'll tell it to look at your new changes.
The agent did have a question for you (to pass on to the Vertex team). In its own words:
@Mawox has been picking up the design direction in #5158 (top-level service_tier with per-provider fallbacks as we discussed), and @anatolec
added Priority PayGo values in #5094. I'm folding both into one PR and want to extend the cross-provider mapping to Vertex, rather than keeping the "Vertex ignored" carve-out — with
pt_then_priority now in scope, the mapping looks clean:
- flex → X-Vertex-AI-LLM-Shared-Request-Type: flex
- priority → X-Vertex-AI-LLM-Shared-Request-Type: priority
- default / auto → no headers (PT-then-on-demand default)
Before I commit to that, there's one thing the public docs don't quite spell out: if a project has zero PT quota on the target model/region and we send only the single shared-request-type
header, does the request fall through safely to Flex/Priority PayGo, or does it 429?@ewjoachim's original writeup describes it as "Uses PT first. Excess traffic spills over to Flex PayGo," and @anatolec saw traffic_type: ON_DEMAND_PRIORITY in a live test — so empirically
it looks safe. But before defaulting every cross-provider service_tier='priority' user through this on Vertex, could you check with the DeepMind / Vertex team?Specifically:
- Zero-PT project + only X-Vertex-AI-LLM-Shared-Request-Type: flex → 429, or Flex PayGo?
- Same with priority → 429, or Priority PayGo?
- For a project with PT quota, is the spillover destination when these single-header requests exceed PT actually Flex/Priority (not standard on-demand)?
If 1+2 fall through safely we'll go single-header (respects PT for customers who have it, safe for everyone else). If not, we'll also send X-Vertex-AI-LLM-Request-Type: shared to guarantee
no PT dependency at the cost of bypassing PT entirely. I'd rather the former if it's actually safe.Thanks!
There was a problem hiding this comment.
Thanks @DouweM — happy to fold this into #5158.
Reproduced Devin's Q1 on a separate zero-PT Vertex project (gemini-3-flash-preview, location='global' — Flex PayGo is preview-only):
pt_then_flex traffic_type='ON_DEMAND_FLEX' ← Q1: single Shared-Request-Type: flex on zero-PT
flex_only traffic_type='ON_DEMAND_FLEX'
pt_only 429, PT quota exceeded ← zero-PT control ✓
Q2 (priority) isn't on this branch — @anatolec's #5094 already shows the same pattern: pt_then_priority → ON_DEMAND_PRIORITY on zero-PT.
Q3 (PT-quota spillover destination) still needs the Vertex team.
If Q3 spills to Flex/Priority: drop the carve-out in #5158, flex/priority → single Shared-Request-Type header, keep google_vertex_service_tier as the escape hatch (needs #5094 folded in first for the priority mapping). If it spills to plain on-demand: keep the carve-out — silent downgrade from priority is worse than requiring the explicit field on Vertex.
There was a problem hiding this comment.
(Just clarifying that I don't feel I'm sufficiently knowledgeable on the subject to add anything meaningful to what has already been said)
* Fix naming convention comment * Use Flex for Vertex * Remove incorrectly supported Groq reference from docstring
Extends `GoogleVertexServiceTier` with `'pt_then_priority'` (PT with Priority PayGo spillover) and `'priority_only'` (Priority PayGo without PT), mirroring the existing Flex PayGo pair. Folds pydantic#5094 in so both PayGo tiers land together.
|
@markmcd @Mawox Thanks for working on this! Pushed four commits on top of 878db9e to consolidate the work across #5094 and #5158:
@markmcd Vertex top-level → priority-header mapping still off pending your Q3 answer (PT-customer-over-quota spillover destination). Once confirmed safe, it's a one-line change to map priority → X-Vertex-AI-LLM-Shared-Request-Type: priority the same way flex |
…service_tier` filter - Bedrock and Google GLA now treat top-level `service_tier='auto'` as "omit from the request", matching the `ServiceTier` docstring's stated semantics. Both providers previously sent an explicit `'default'` / `'standard'` tier, which was functionally equivalent but prevented `'auto'` from acting as a clean override-to-unset for inherited settings. - Cerebras: add `openai_service_tier` alongside the pre-existing `service_tier` entry in `openai_unsupported_model_settings`, so the per-provider field is also filtered out rather than forwarded to an API that doesn't accept it. - Clarify in the `bedrock_service_tier` docstring that it is the only way to request `'reserved'` (which needs a pre-purchased capacity reservation).
…bump Fixes three tests that failed on `main` after the SDK bump: - File search snapshots now include the `file_search_store` field the 1.70 response payload adds for built-in file-search tool returns (`test_google_model_file_search_tool`, `_stream`). - The streaming safety-filter test mock now pins `sdk_http_response=None` so the new `x-gemini-service-tier` header lookup on every chunk does not pull a `Mock` object into `provider_details` and break the later pydantic serialization in `ContentFilterError.body` (`test_google_stream_safety_filter`).
Covers the cross-provider `service_tier` → Anthropic request-value mapping and the `anthropic_service_tier` per-provider override: - `'auto'` passes through (Anthropic accepts it natively) - `'default'` maps to `'standard_only'` - `'flex'` / `'priority'` are silently omitted (not supported by Anthropic) - `anthropic_service_tier` wins over the top-level `service_tier` Addresses the Devin Review finding on pydantic#4926 about missing Anthropic coverage parallel to the existing OpenAI/Google/Bedrock tests.
| service_tier = model_settings.get('anthropic_service_tier') or model_settings.get('service_tier') | ||
| if service_tier == 'default': | ||
| service_tier = 'standard_only' | ||
| elif service_tier not in ('auto', 'standard_only'): | ||
| service_tier = OMIT |
There was a problem hiding this comment.
Minor: when service_tier is not set at all (neither anthropic_service_tier nor service_tier), the or chain evaluates to None, and then None not in ('auto', 'standard_only') is True, so service_tier gets set to OMIT. This works but is subtle — it would be cleaner to guard with an early if service_tier is None: service_tier = OMIT before the mapping logic, for readability.
- `test_anthropic_service_tier_mapping`: restructure params so `AnthropicModelSettings` is constructed inside the test body. The previous parametrize decorator referenced it at module scope, which failed collection on the slim/lowest/pydantic-evals CI matrices that don't install the `anthropic` extra (NameError before `pytestmark` skipif could apply). - Vertex logprobs snapshots: pick up the new `log_probability_sum: None` field the google-genai 1.70 response now exposes (was failing on `all-extras` matrices). - Capability schema snapshot: pick up the new `service_tier` field on `ModelSettings` that this PR adds.
… Bedrock branch
- `_google_vertex_service_tier_headers` now takes `GoogleVertexServiceTier | ServiceTier`
and uses `assert_never` instead of a defensive `.lower()` + `return {}` fallback.
All callers already pass typed values; the stringly-typed shim + dead `'standard'`
branch were a carryover from earlier iterations and left coverage at 99.81%.
- Bedrock: drop the redundant `in ('default', 'flex', 'priority')` inner check.
`ServiceTier = Literal['auto', 'default', 'flex', 'priority']`, and the outer
branch already excludes `'auto'`, so the guard was unreachable.
… doc/test fixes
Addresses auto-review bot findings on the prior push:
- `google_service_tier` now emits a `DeprecationWarning` when consulted
(factored into `_get_deprecated_google_service_tier`, called from both the
Vertex header path and the GLA service-tier path). Adds a regression test.
- Restore `OpenRouter`, `Cerebras`, and `xAI` in the `thinking` docstring
'Supported by' list — dropped in the earlier consolidation, all three
support it through their OpenAI-based implementations.
- Bedrock docs: reflect the actual behavior that `service_tier='auto'` omits
the `serviceTier` field rather than sending `{'type': 'default'}`, and note
`'reserved'` is only reachable through `bedrock_service_tier`.
- Switch the Vertex-headers parametrize test + VCR tests to
`google_vertex_service_tier` so they don't emit the new deprecation warning.
…thropic None-early-return - Map top-level `service_tier='priority'` to `X-Vertex-AI-LLM-Shared-Request-Type: priority` on Vertex AI, symmetric with how `'flex'` already maps. Both stay single-header so Provisioned Throughput customers still use PT first; `google_vertex_service_tier='priority_only'` is the explicit escape hatch for anyone who wants to skip PT. Addresses the Devin finding about the `priority` vs. `flex` asymmetry and the auto-review bot note on `GoogleVertexServiceTier` parametrization; adds coverage for both `'flex'` and `'priority'`. - Extract `_resolve_gla_service_tier` + `_resolve_vertex_service_tier` helpers so `_build_content_and_config` no longer needs `# noqa: C901` and each resolution is independently testable. - Anthropic: swap the `or`-chain mapping for an early `None → OMIT` return for readability.
| elif (unified_tier := model_settings.get('service_tier')) and unified_tier != 'auto': | ||
| params['serviceTier'] = {'type': unified_tier} |
There was a problem hiding this comment.
🚩 Bedrock unified service_tier='default' maps to {'type': 'default'} — verify this is valid
At bedrock.py:696-697, the unified service_tier='default' is wrapped as {'type': 'default'}. The ServiceTierTypeDef accepts Literal['default', 'flex', 'priority', 'reserved'], so 'default' should be valid. However, the Bedrock docs page linked from the docstring should be checked to confirm that 'default' is actually a meaningful tier value (as opposed to just being the absence of a tier selection). The test at test_bedrock.py:678-718 mocks the Bedrock client and verifies the dict structure but doesn't validate against the real API.
Was this helpful? React with 👍 or 👎 to provide feedback.
| `google_vertex_service_tier`) take precedence over this unified field. | ||
|
|
||
| Supported by: | ||
|
|
||
| * OpenAI | ||
| * Gemini |
There was a problem hiding this comment.
I might be completely off, but it seems strange that we mention "Gemini" here (I believe for GLA?), and google_vertex_service_tier above. I wonder it this bullet list should also mention Vertex (especially since Vertex can be used for non-gemini models)
(also, it looks like this docstring is duplicated from the TypeAlias above, which might lead to either getting out of sync)
| then the top-level `service_tier`. Maps `'default'` → `'standard'`; drops any value | ||
| that isn't valid for GLA (including `'auto'`, which signals "let the server decide"). | ||
| """ | ||
| raw = _get_deprecated_google_service_tier(model_settings) or model_settings.get('service_tier') |
There was a problem hiding this comment.
I'm afraid conflating the provider-specific service-tier and the model_settings service tier is bound to create headaches: at some point, some provider is going to call their default mode "flex" or something like that.
Rather than putting either value in a variable and then handle a GoogleVertexServiceTier | ServiceTier, I think it's much much saner to:
- See if a provider-specific (in this case
GoogleVertexServiceTier) value is defined. If so, use it. - If not, map the
ServiceTierto aGoogleVertexServiceTier - Always handle a
GoogleVertexServiceTier
(here I'm saying this for Vertex but that would be the way we handle it for every other provider)
As the zen of python says: "In the face of ambiguity, refuse the temptation to guess."
…m docstring duplication Per @ewjoachim's review on pydantic#4926: separate the cross-provider mapping from the Vertex-headers helper so the helper is purely about Vertex routing, and avoid future ambiguity if another provider's tier values ever collide with the top-level `ServiceTier` literals. - `_resolve_vertex_service_tier` now resolves to `GoogleVertexServiceTier` directly, using a `_TOP_LEVEL_TO_VERTEX_SERVICE_TIER` lookup for the cross-provider fallback (`'flex'` → `'pt_then_flex'`, `'priority'` → `'pt_then_priority'`, etc.). - `_google_vertex_service_tier_headers` parameter is back to a strict `GoogleVertexServiceTier` Literal, with no provider-cross-mapping branches. - `ServiceTier` TypeAlias docstring slimmed to value semantics only; `ModelSettings.service_tier` field now points at the alias instead of repeating the value list, and lists "Google (Gemini API and Vertex AI)" rather than just "Gemini" (the unified field maps on both Google subsystems now).
…ence; Vertex unified→header detail Addresses the "can a reader figure out how this works from the docs" gap and addresses the auto-review bot's stacklevel feedback. - `ServiceTier` TypeAlias docstring now carries the canonical cross-provider mapping table and the precedence rule (per-provider field wins). The `ModelSettings.service_tier` field defers to the alias for value semantics to keep the two from drifting. - `docs/models/google.md`: spell out the unified→Vertex header mapping (`'flex'` → `Shared-Request-Type: flex`, `'priority'` → `Shared-Request-Type: priority`, `'auto'`/`'default'` → no headers, all PT-with-spillover) instead of "this sets the default routing behavior", with a note that bypassing PT requires the per-provider field. - `docs/models/openai.md`/`anthropic.md`/`bedrock.md`: short precedence sentence + cleaner mapping description on each. - Drop deprecation warning `stacklevel` from 3 to 2 — points at the resolver rather than the now-unhelpful `_build_content_and_config` caller, and is stable across refactors of the request-build pipeline.
|
All issues referenced by this PR are already closed. If you believe an issue should be reopened, please comment on it first. |
service_tier model setting
|
All issues referenced by this PR are already closed. If you believe an issue should be reopened, please comment on it first. |
service_tier model settingservice_tier model setting; Anthropic + Gemini API + Vertex Priority PayGo support
|
All issues referenced by this PR are already closed. If you believe an issue should be reopened, please comment on it first. |
# Conflicts: # pydantic_ai_slim/pydantic_ai/models/anthropic.py
Adds `_resolve_openai_service_tier` / `_resolve_anthropic_service_tier` helpers that check the provider-specific override first, then map the unified `service_tier` to a strictly-typed provider value. Mirrors the Bedrock + Vertex shape so all four providers handle the unified field the same way.
Stops conflating the deprecated `google_service_tier` alias with the unified `service_tier` in `_resolve_gla_service_tier`: each is mapped through the same lookup table independently, with the alias winning when set. Tightens the return type to `Literal['standard', 'flex', 'priority'] | None` to mirror the other provider helpers.
| 'presence_penalty', | ||
| 'parallel_tool_calls', | ||
| 'service_tier', | ||
| 'openai_service_tier', | ||
| ) |
There was a problem hiding this comment.
🚩 Other OpenAI-compatible providers (Groq, xAI, OpenRouter) will silently pass service_tier through to the API
Only Cerebras explicitly marks service_tier and openai_service_tier as unsupported via openai_unsupported_model_settings. Other OpenAI-compatible providers (Groq, xAI, OpenRouter) don't declare these as unsupported, so the _resolve_openai_service_tier function will resolve the unified service_tier and pass it through to those APIs. Whether this is a bug depends on whether those providers support or ignore the service_tier parameter — OpenAI-compatible APIs generally ignore unknown parameters, so this is likely harmless, but it's worth verifying.
(Refers to lines 64-71)
Was this helpful? React with 👍 or 👎 to provide feedback.
There was a problem hiding this comment.
OK to leave as is, I believe.
`GoogleServiceTier` only contains Vertex-shaped values (`pt_then_*`, `*_only`), so on GLA every alias value falls through the `_GLA_VALUE_MAP` lookup. Replace the dead branch with a single `_get_deprecated_google_service_tier()` call that preserves the deprecation warning while keeping the resolver branch-coverable.
… API + Vertex Priority PayGo support (pydantic#4926) Co-authored-by: Anatole Callies <[email protected]> Co-authored-by: Douwe Maan <[email protected]> Co-authored-by: Mark McDonnell <[email protected]>
Summary
Supersedes the original Vertex-only
google_service_tierdesign and consolidates the cross-provider service-tier work.Adds a unified [
service_tier][pydantic_ai.settings.ModelSettings.service_tier] field onModelSettings, mapped to each provider's native service-tier concept where one exists. Provider-specific overrides remain available for values that don't fit the unified set.This PR consolidates earlier exploration work in #5158 (closed, by @Mawox) and #5094 (closed, by @anatolec — Priority PayGo on Vertex). Their commits are preserved in this branch's history.
Cross-provider mapping
service_tieraccepts'auto' | 'default' | 'flex' | 'priority':'auto''auto''auto''default''default''standard_only'{'type': 'default'}'standard''flex''flex'{'type': 'flex'}'flex'Shared-Request-Type: flex(PT then Flex PayGo)'priority''priority'{'type': 'priority'}'priority'Shared-Request-Type: priority(PT then Priority PayGo)Per-provider settings (
openai_service_tier,anthropic_service_tier,bedrock_service_tier,google_vertex_service_tier) always take precedence over the unified field, and they're the only way to reach values that aren't in the unified set: Bedrock's'reserved', Anthropic's'standard_only'explicit form, and Vertex's full PT-routing matrix ('pt_only','on_demand','flex_only','priority_only', etc.).'auto'vs'default'distinction:'auto'lets the provider decide and may include premium tiers when available (matters for OpenAI's scale credits and Anthropic's priority capacity).'default'explicitly opts out of those promotions. On Bedrock / Google they're functionally equivalent today, but encoded forward-compatibly through the omit-vs-explicit wire choice.Vertex AI design choice
The unified
'flex'and'priority'map to the PT-with-spillover variants (singleShared-Request-Typeheader, noRequest-Type: shared), so Vertex customers with Provisioned Throughput keep using their reserved capacity first. To bypass PT entirely, setgoogle_vertex_service_tier='flex_only'/'priority_only'directly. Open question with Google for confirmation: when a PT customer exceeds quota with the single-header form, does spillover land in Flex/Priority or in standard PayGo? Empirically (Mawox's reproduction on a zero-PT project, anatolec's #5094 live test) the headers fall through safely; the PT-customer-over-quota case is the only path not yet experimentally confirmed.Other behavior changes
google_service_tier(the original Vertex-only field) is deprecated in favor ofgoogle_vertex_service_tier. Reading it emits aDeprecationWarning. The values are unchanged.'pt_then_priority'and'priority_only'Vertex routing values (from Implement support for Priority PayGo with VertexAI #5094).openai_service_tieradded to the unsupported-settings filter (latent bug — the per-provider field was being forwarded to an API that doesn't accept it).google-genaibumped to>=1.70.0for the SDK's newServiceTierenum on the Gemini API.Test plan
DeprecationWarningregression test forgoogle_service_tier.google_service_tier='default'/'standard'/'flex'/'priority').