fix: strip titles from bare-metadata nodes (Gemini 2.5 Flash)#3881
fix: strip titles from bare-metadata nodes (Gemini 2.5 Flash)#3881
Conversation
Two follow-ups to #3861's Gemini 2.5 Flash title-stripping fix: 1. Pydantic emits bare {"title": "X"} schemas for Any-typed fields with no sibling type/properties. The original heuristic required a schema keyword before stripping, so those bare nodes leaked through and Gemini still hit MALFORMED_FUNCTION_CALL. Now we also strip when every remaining key is a pure metadata key (title/description/deprecated/readOnly/writeOnly). 2. The traversal no longer recurses into default, const, examples, or enum. Those hold literal user values — a user default of {"title": "X", "type": "Y"} was at risk of having its keys interpreted as schema metadata.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8b2a8db49e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| # Keywords whose values are literal user data, not sub-schemas. Skipping | ||
| # recursion here prevents `default: {"title": "X"}` from losing the "title" | ||
| # data value because it happens to look metadata-shaped. | ||
| _LITERAL_KEYWORDS = frozenset({"default", "const", "examples", "enum"}) |
There was a problem hiding this comment.
Include
example in literal-keyword skip list
The new traversal guard skips literal payload keywords via _LITERAL_KEYWORDS, but omitting example means prune_titles=True still descends into example values and strips user data fields named title. In practice, schemas produced with Pydantic json_schema_extra={"example": {"title": "Dashboard", ...}} now have that example mutated during compression, which is a regression from the previous behavior and contradicts the goal of preserving literal values.
Useful? React with 👍 / 👎.
Two follow-ups after the first round:
1. Skipping recursion on literal-keyword keys broke schemas where a user
property is literally named 'enum' or 'default' — the traversal stopped
at the property name and never reached the sub-schema's $ref, so
referenced $defs got pruned as unused (AssertionError in
test_all_primitive_field_types). Split arbitrary-key dicts (properties,
patternProperties, $defs, definitions, dependentSchemas) into their
own branch so their values are always traversed as sub-schemas
regardless of key name.
2. Added 'example' (OpenAPI/Swagger 2.0 singular) alongside 'examples'
(JSON Schema draft 7+) to the literal-keyword list. Codex flagged that
Pydantic's json_schema_extra={'example': {'title': '...'}} was still
getting its literal user data stripped.
3. Updated test_lambda snapshot — the whole point of this PR is that bare
{'title': 'X'} nodes (Pydantic Any fields) now get pruned, so the
snapshot's {'x': {'title': 'X'}} is now {'x': {}}.
Test Failure AnalysisSummary: The test Root Cause: In _LITERAL_KEYWORDS = frozenset({"default", "const", "examples", "enum"})and then skips recursion into any dict key matching that set: if key in _LITERAL_KEYWORDS:
continueThe problem: this check runs while iterating over every dict — including a schema's "enum": {"$ref": "#/$defs/DataEnum"}When Suggested Solution: In for key, value in node.items():
if skip_defs_section and key == "$defs":
continue
# Properties dict keys are user-defined property names, not schema keywords.
# Recurse into each property sub-schema directly, bypassing keyword checks.
if key == "properties" and isinstance(value, dict):
for prop_schema in value.values():
traverse_and_clean(prop_schema, current_def_name, depth=depth + 1)
continue
# Don't descend into keywords that carry literal data, not sub-schemas
if key in _LITERAL_KEYWORDS:
continue
if key in ["allOf", "oneOf", "anyOf"] and isinstance(value, list):
...The same bug would affect properties named Detailed AnalysisFailing test: Error (from job logs): The schema reaching the validator has a Trace through the new code (
Jobs that failed:
Related Files
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 055f50ea45
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| and isinstance(node["title"], str) # type: ignore | ||
| and ( | ||
| any(k in node for k in _SCHEMA_KEYWORDS) | ||
| or all(k in _METADATA_KEYS for k in node) |
There was a problem hiding this comment.
Limit metadata-only title stripping to schema contexts
The new all(k in _METADATA_KEYS for k in node) branch removes title from any dict that happens to have metadata-like keys, even when that dict is user extension payload rather than a schema node. For example, with json_schema_extra={"x-ui": {"title": "Dashboard", "description": "..."}}, compress_schema(..., prune_titles=True) now mutates x-ui.title to nothing, which is a regression introduced by this commit (previous logic required structural schema keywords before pruning).
Useful? React with 👍 / 👎.
Codex flagged that the bare-metadata title-strip heuristic fires on any
dict whose keys are all metadata — including user extension payloads like
json_schema_extra={'x-ui': {'title': 'Dashboard', 'description': '...'}},
where the payload is opaque to us but happens to look metadata-shaped.
Fix: thread an in_schema flag through traverse_and_clean. We only apply
title/additionalProperties cleanups when we arrived via a known schema
position (root, properties/$defs/patternProperties values, items,
allOf/anyOf/oneOf/prefixItems elements, or single-schema keywords like
if/then/else/contains/propertyNames). Unknown keys (vendor extensions,
arbitrary user dicts) still get traversed for $ref collection, but
in_schema=False so no cleanups run inside them.
Also classified the schema-valued keywords explicitly (SUBSCHEMA_VALUE_KEYS
and SUBSCHEMA_LIST_KEYS) to route traversal correctly.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c1d2a27594
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| # Unknown keys (user extensions like `x-ui`, vendor | ||
| # metadata, etc.) — descend for $ref collection but mark | ||
| # in_schema=False so cleanups don't touch user payloads. | ||
| traverse_and_clean( | ||
| value, current_def_name, depth=depth + 1, in_schema=False |
There was a problem hiding this comment.
Traverse unlisted schema keywords in schema mode
The fallback path now recurses unknown keys with in_schema=False, which means any valid JSON Schema keyword missing from the hardcoded sets (for example draft-07 dependencies and additionalItems) is treated as opaque payload instead of a subschema. In those branches, compress_schema(..., prune_titles=True) and prune_additional_properties=True stop applying cleanups, whereas the parent implementation still cleaned them, so this introduces a regression for schemas that rely on those standard keywords.
Useful? React with 👍 / 👎.
Codex flagged that unlisted sub-schema keywords now fall into the in_schema=False branch and skip cleanups entirely. Add the missing standard keywords: - dependencies (draft-07): values may be sub-schemas (list values short-circuit in the list branch, so it's safe in _SUBSCHEMA_MAP_KEYS) - additionalItems (draft-07): single sub-schema, precursor to unevaluatedItems - contentSchema (2019-09+): single sub-schema for typed string payloads
Follow-up to #3861.
Two edge cases Codex flagged on the original Gemini title-stripping PR that were left open:
Bare-metadata nodes leak through. Pydantic emits
{"title": "X"}forAny-typed fields — notype, noproperties, nothing to make the heuristic fire. Gemini 2.5 Flash rejects those withMALFORMED_FUNCTION_CALL, meaning the original fix didn't actually resolve the bug for Pydantic schemas that include anyAny-typed field. The heuristic now stripstitlewhen every remaining key is a pure JSON Schema metadata keyword (title,description,deprecated,readOnly,writeOnly), which covers the Pydantic-Any case without touching user property dicts.Literal values get corrupted. The traversal recursed into
default,const,examples, andenumunconditionally. A user default shaped like{"title": "My Dashboard", "type": "vis"}could silently have its"title"field interpreted as schema metadata and stripped. These keywords carry literal user data, not sub-schemas, so the traversal now skips them.The "property literally named
title" case Codex raised is already handled correctly by the existingisinstance(title, str)guard — whenproperties["title"]is a dict (a sub-schema), it isn't stripped. Covered by an existing regression test (test_title_pruning_preserves_parameter_named_title).