fix(citation): use position-based insertion for Gemini grounding supports#13646
fix(citation): use position-based insertion for Gemini grounding supports#13646
Conversation
…orts Gemini API can return groundingSupports with very short text segments (e.g. "**") which caused all matching occurrences in the content to receive citation tags. Switch from text-based regex replacement to position-based insertion using startIndex/endIndex from the segment metadata to fix over-matching. Fixes #8880 Co-Authored-By: Claude Opus 4.6 <[email protected]> Signed-off-by: suyao <[email protected]>
Gemini API's groundingSupports segment endIndex is a UTF-8 byte offset, not a JS character offset. CJK characters are 3 bytes in UTF-8 but 1 character in JavaScript, causing citations to be inserted at wrong positions for non-ASCII content. Use TextEncoder/TextDecoder to convert byte offsets to character offsets before slicing the content string. Fixes #8880 Co-Authored-By: Claude Opus 4.6 <[email protected]> Signed-off-by: suyao <[email protected]>
There was a problem hiding this comment.
I re-checked the fix and I think the direction is correct.
The old implementation was fundamentally unsafe for Gemini grounding because it treated segment.text as a global replacement key, which explodes on very short or repeated fragments like **. Switching to endIndex-based insertion matches the API contract much better and avoids the pathological “tag every matching token in the whole response” behavior that caused broken rendering and lag.
I also checked the updated tests: besides adapting the existing Gemini cases to positional metadata, the PR now includes a direct regression test for issue #8880 and a CJK/UTF-8 offset test, which are exactly the two places this kind of fix is most likely to go wrong. I did not find a new correctness blocker in the updated logic. The remaining risk is mostly whether Gemini always reports stable offsets in more exotic markdown/content mixes, but that feels like normal follow-up risk rather than a reason to block this patch.
Add a fixture file with real groundingChunks/groundingSupports data from a Gemini 3 Pro response, and snapshot tests that verify: - normalizeCitationMarks inserts [cite:N] at correct positions - withCitationTags produces correct final HTML output - No over-matching occurs (exactly 15 total cite references) Co-Authored-By: Claude Opus 4.6 <[email protected]> Signed-off-by: suyao <[email protected]>
EurFelux
left a comment
There was a problem hiding this comment.
LGTM! The position-based insertion approach is correct and well-tested. Good coverage for the UTF-8 byte offset edge case with CJK characters.
What this PR does
Before this PR:
Gemini
groundingSupportssegments with very short text (e.g.**for markdown bold markers) caused all matching occurrences in the content to receive citation tags, resulting in dense, broken rendering and significant UI lag.After this PR:
Citation tags are inserted at the exact position indicated by the segment's
startIndex/endIndex, preventing over-matching on short or repeated text patterns.Fixes #8880
Why we need it and why it was done in this way
The Gemini API returns
groundingSupportswith asegmentobject containing both positional data (startIndex/endIndex) and the matchedtext. The previous implementation used text-based regex replacement, which is unreliable when the segment text is very short or common (e.g.**). Using the position indices directly is more precise and aligns with how the API intends the data to be consumed.The following tradeoffs were made:
The following alternatives were considered:
Breaking changes
None.
Special notes for your reviewer
WEB_SEARCH_SOURCE.GEMINIcase innormalizeCitationMarksis changed. All other source types remain unchanged.**text matching all bold markers).Checklist
/gh-pr-review,gh pr diff, or GitHub UI) before requesting review from othersRelease note