Skip to content

fix(citation): use position-based insertion for Gemini grounding supports#13646

Merged
DeJeune merged 3 commits intomainfrom
DeJeune/fix-citation-tags
Mar 22, 2026
Merged

fix(citation): use position-based insertion for Gemini grounding supports#13646
DeJeune merged 3 commits intomainfrom
DeJeune/fix-citation-tags

Conversation

@DeJeune
Copy link
Copy Markdown
Collaborator

@DeJeune DeJeune commented Mar 19, 2026

What this PR does

Before this PR:
Gemini groundingSupports segments with very short text (e.g. ** for markdown bold markers) caused all matching occurrences in the content to receive citation tags, resulting in dense, broken rendering and significant UI lag.

After this PR:
Citation tags are inserted at the exact position indicated by the segment's startIndex/endIndex, preventing over-matching on short or repeated text patterns.

image

Fixes #8880

Why we need it and why it was done in this way

The Gemini API returns groundingSupports with a segment object containing both positional data (startIndex/endIndex) and the matched text. The previous implementation used text-based regex replacement, which is unreliable when the segment text is very short or common (e.g. **). Using the position indices directly is more precise and aligns with how the API intends the data to be consumed.

The following tradeoffs were made:

  • Position-based insertion means we no longer do fuzzy text matching for Gemini citations. This is intentional since the API provides exact positions.

The following alternatives were considered:

  • Filtering out very short segments — rejected because the positions are still valid and useful.

Breaking changes

None.

Special notes for your reviewer

  • Only the WEB_SEARCH_SOURCE.GEMINI case in normalizeCitationMarks is changed. All other source types remain unchanged.
  • A new test case reproduces the exact bug scenario from the issue (short ** text matching all bold markers).

Checklist

Release note

Fixed Gemini citation over-matching caused by short text segments (e.g. `**`) in groundingSupports, which previously added citation tags to every bold marker in the response.

…orts

Gemini API can return groundingSupports with very short text segments
(e.g. "**") which caused all matching occurrences in the content to
receive citation tags. Switch from text-based regex replacement to
position-based insertion using startIndex/endIndex from the segment
metadata to fix over-matching.

Fixes #8880

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Signed-off-by: suyao <[email protected]>
@DeJeune DeJeune requested a review from EurFelux March 19, 2026 15:33
Gemini API's groundingSupports segment endIndex is a UTF-8 byte offset,
not a JS character offset. CJK characters are 3 bytes in UTF-8 but
1 character in JavaScript, causing citations to be inserted at wrong
positions for non-ASCII content.

Use TextEncoder/TextDecoder to convert byte offsets to character offsets
before slicing the content string.

Fixes #8880

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Signed-off-by: suyao <[email protected]>
Copy link
Copy Markdown
Contributor

@cherry-ai-bot cherry-ai-bot bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I re-checked the fix and I think the direction is correct.

The old implementation was fundamentally unsafe for Gemini grounding because it treated segment.text as a global replacement key, which explodes on very short or repeated fragments like **. Switching to endIndex-based insertion matches the API contract much better and avoids the pathological “tag every matching token in the whole response” behavior that caused broken rendering and lag.

I also checked the updated tests: besides adapting the existing Gemini cases to positional metadata, the PR now includes a direct regression test for issue #8880 and a CJK/UTF-8 offset test, which are exactly the two places this kind of fix is most likely to go wrong. I did not find a new correctness blocker in the updated logic. The remaining risk is mostly whether Gemini always reports stable offsets in more exotic markdown/content mixes, but that feels like normal follow-up risk rather than a reason to block this patch.

Add a fixture file with real groundingChunks/groundingSupports data from
a Gemini 3 Pro response, and snapshot tests that verify:
- normalizeCitationMarks inserts [cite:N] at correct positions
- withCitationTags produces correct final HTML output
- No over-matching occurs (exactly 15 total cite references)

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Signed-off-by: suyao <[email protected]>
@DeJeune DeJeune requested a review from alephpiece March 22, 2026 10:38
Copy link
Copy Markdown
Collaborator

@EurFelux EurFelux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! The position-based insertion approach is correct and well-tested. Good coverage for the UTF-8 byte offset edge case with CJK characters.

@DeJeune DeJeune merged commit 6b7d72e into main Mar 22, 2026
7 checks passed
@DeJeune DeJeune deleted the DeJeune/fix-citation-tags branch March 22, 2026 15:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Citation overmatching caused by short gemini text segments from groundingSupports

2 participants