Skip to content

[None][fix] Fix chunked prefill API contract for nemotron nano VL#13025

Merged
2ez4bz merged 1 commit into
NVIDIA:mainfrom
2ez4bz:dev-nano-v3-fix-chunked-prefill-api
Apr 17, 2026
Merged

[None][fix] Fix chunked prefill API contract for nemotron nano VL#13025
2ez4bz merged 1 commit into
NVIDIA:mainfrom
2ez4bz:dev-nano-v3-fix-chunked-prefill-api

Conversation

@2ez4bz
Copy link
Copy Markdown
Collaborator

@2ez4bz 2ez4bz commented Apr 14, 2026

Summary by CodeRabbit

  • Improvements

    • Enhanced diagnostic logging for multimodal embedding handling with more detailed information about returned tensors, embedding types, and token counts.
  • Tests

    • Added comprehensive tests validating multimodal embedding caching behavior, encoder contract compliance, and chunked prefill compatibility.

Description

  • Why?

In order to opt into the caching functionality for chunked prefix, there are certain assumptions on the return type of the encoder's forward function. These assumptions did not hold for nemotron nano VL prior to this commit.

  • What?

This commit fixes this issue, and adds tests to catch regressions.

Test Coverage

See above.

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

@2ez4bz 2ez4bz requested a review from a team as a code owner April 14, 2026 06:02
@2ez4bz 2ez4bz requested a review from symphonylyh April 14, 2026 06:02
@2ez4bz 2ez4bz force-pushed the dev-nano-v3-fix-chunked-prefill-api branch 2 times, most recently from f4313d5 to 35608fe Compare April 14, 2026 06:09
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 14, 2026

Caution

Review failed

An error occurred during the review process. Please try again later.

📝 Walkthrough

Walkthrough

These changes refactor multimodal embedding handling in Nemotron Nano models. The _encode_multimodal method now returns only a list of concatenated tensors instead of a tuple with token counts; token counts are stored directly in multimodal parameters. The forward method reconstructs token counts from context parameters when needed. Enhanced logging provides more detailed warning information for debugging.

Changes

Cohort / File(s) Summary
Logging Enhancement
tensorrt_llm/_torch/models/modeling_multimodal_utils.py
Enhanced logger.warning message to include number of returned embeddings, embedding type names, shapes, count of uncached multimodal params, and encoder function reference when encoder_forward_fn returns multiple tensors.
Core API Changes
tensorrt_llm/_torch/models/modeling_nemotron_nano.py
Modified _encode_multimodal to return List[torch.Tensor] instead of tuple with token counts. Stores per-request num_tokens_in_video in MultimodalParams.multimodal_data. Updated forward to reconstruct num_tokens_in_videos from context params. Changed merge_evs_mm_embeds invocation to use sliced context params.
Test Updates
tests/unittest/_torch/modeling/test_modeling_nemotron_nano_v2_vl.py
Added imports and comprehensive tests for multimodal caching contract. Updated existing dispatch tests for new return signature. Introduced TestEncodeMultimodalContract validating list-of-single-tensor return, and TestChunkedPrefillCaching verifying encoder caching behavior.

Sequence Diagram(s)

sequenceDiagram
    participant Forward as Forward()
    participant Encoder as _encode_multimodal()
    participant Cache as MultimodalParams<br/>(Cache Storage)
    participant Merge as merge_evs_mm_embeds()
    participant Context as Context Params

    Forward->>Encoder: Call with uncached params
    Encoder->>Encoder: Vision/Audio encoding
    Encoder->>Cache: Store embeddings in<br/>multimodal_data
    Encoder->>Cache: Store num_tokens_in_video
    Encoder-->>Forward: Return List[Tensor]
    Forward->>Merge: Pass merged embeddings<br/>+ context params
    Merge-->>Forward: Merged result
    Forward->>Context: Reconstruct num_tokens<br/>from cached values
    Forward-->>Forward: Continue with<br/>merged embeddings
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: fixing the chunked prefill API contract for nemotron nano VL, which aligns with the primary objective and code modifications across all three files.
Description check ✅ Passed The description adequately explains the issue (assumptions about encoder return type not holding for nemotron nano VL), the solution (fixing the API contract and adding tests), and provides test coverage details.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@2ez4bz 2ez4bz force-pushed the dev-nano-v3-fix-chunked-prefill-api branch from 35608fe to e7868dd Compare April 14, 2026 16:52
@2ez4bz
Copy link
Copy Markdown
Collaborator Author

2ez4bz commented Apr 14, 2026

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43265 [ run ] triggered by Bot. Commit: e7868dd Link to invocation

@2ez4bz 2ez4bz enabled auto-merge (squash) April 14, 2026 17:54
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43265 [ run ] completed with state SUCCESS. Commit: e7868dd
/LLM/main/L0_MergeRequest_PR pipeline #33818 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@2ez4bz
Copy link
Copy Markdown
Collaborator Author

2ez4bz commented Apr 15, 2026

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43372 [ run ] triggered by Bot. Commit: e7868dd Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43372 [ run ] completed with state SUCCESS. Commit: e7868dd
/LLM/main/L0_MergeRequest_PR pipeline #33907 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@Wanli-Jiang
Copy link
Copy Markdown
Collaborator

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43443 [ run ] triggered by Bot. Commit: e7868dd Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43443 [ run ] completed with state SUCCESS. Commit: e7868dd
/LLM/main/L0_MergeRequest_PR pipeline #33971 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@Wanli-Jiang
Copy link
Copy Markdown
Collaborator

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43486 [ run ] triggered by Bot. Commit: e7868dd Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43486 [ run ] completed with state FAILURE. Commit: e7868dd
/LLM/main/L0_MergeRequest_PR pipeline #34006 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@2ez4bz
Copy link
Copy Markdown
Collaborator Author

2ez4bz commented Apr 15, 2026

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43540 [ run ] triggered by Bot. Commit: e7868dd Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43540 [ run ] completed with state SUCCESS. Commit: e7868dd
/LLM/main/L0_MergeRequest_PR pipeline #34045 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@2ez4bz
Copy link
Copy Markdown
Collaborator Author

2ez4bz commented Apr 15, 2026

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43556 [ run ] triggered by Bot. Commit: e7868dd Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43556 [ run ] completed with state SUCCESS. Commit: e7868dd
/LLM/main/L0_MergeRequest_PR pipeline #34056 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

* Why?

In order to opt into the caching functionality for chunked prefix, there
are certain assumptions on the return type of the encoder's forward
function. These assumptions did not hold for nemotron nano VL prior to
this commit.

* What?

This commit fixes this issue, and adds tests to catch regressions.

Signed-off-by: William Zhang <[email protected]>
@2ez4bz 2ez4bz force-pushed the dev-nano-v3-fix-chunked-prefill-api branch from e7868dd to 1caf386 Compare April 15, 2026 19:49
@2ez4bz
Copy link
Copy Markdown
Collaborator Author

2ez4bz commented Apr 15, 2026

/bot run --disable-fail-fast

@2ez4bz
Copy link
Copy Markdown
Collaborator Author

2ez4bz commented Apr 16, 2026

/bot run

1 similar comment
@2ez4bz
Copy link
Copy Markdown
Collaborator Author

2ez4bz commented Apr 16, 2026

/bot run

@venkywonka
Copy link
Copy Markdown
Collaborator

@coderabbitai what's the api contract described here?

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 16, 2026

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
cat tensorrt_llm/_torch/models/modeling_multimodal_utils.py | grep -n "encoder_forward_fn\|chunked_prefill\|cach\|api contract\|contract\|List\[torch\|return" | head -60

echo "---"
cat tensorrt_llm/_torch/models/modeling_nemotron_nano.py | grep -n "_encode_multimodal\|chunked_prefill\|num_tokens\|List\[torch\|return\|multimodal_data\|multimodal_embedding" | head -80

echo "---"
cat tests/unittest/_torch/modeling/test_modeling_nemotron_nano_v2_vl.py | grep -n "_assert_compatible\|chunked_prefill\|contract\|List\|length 1\|single tensor\|cach" | head -60

Length of output: 10999


🏁 Script executed:

#!/bin/bash
# Get the _assert_compatible_with_chunked_prefill function and surrounding context
sed -n '220,300p' tests/unittest/_torch/modeling/test_modeling_nemotron_nano_v2_vl.py

echo "=== get_multimodal_embeddings docstring ==="
sed -n '110,200p' tensorrt_llm/_torch/models/modeling_multimodal_utils.py

echo "=== _encode_multimodal in nemotron_nano ==="
sed -n '680,760p' tensorrt_llm/_torch/models/modeling_nemotron_nano.py

Length of output: 11619


🏁 Script executed:

#!/bin/bash
# Get the _encode_multimodal method in the NemotronH_Nano_VL_V2 class
grep -n "_encode_multimodal\|def forward\|def _encode_audio\|def _encode_image\|def _encode_video" tensorrt_llm/_torch/models/modeling_nemotron_nano.py | head -30

echo "==="
# Find and print _encode_multimodal method
ast-grep --pattern 'def _encode_multimodal($$$) {
  $$$
}'

echo "=== find _encode_multimodal in nemotron_nano.py ==="
grep -n "_encode_multimodal" tensorrt_llm/_torch/models/modeling_nemotron_nano.py

echo "=== read _encode_multimodal context ==="
# Find line number and print surrounding lines
LINE=$(grep -n "def _encode_multimodal" tensorrt_llm/_torch/models/modeling_nemotron_nano.py | head -1 | cut -d: -f1)
sed -n "${LINE},$((LINE+60))p" tensorrt_llm/_torch/models/modeling_nemotron_nano.py

Length of output: 3401


@venkywonka, here's a breakdown of the chunked prefill API contract as described in this PR:


The Contract: encoder_forward_fn must return List[torch.Tensor] of length exactly 1

The central utility get_multimodal_embeddings (in modeling_multimodal_utils.py) wraps the vision/audio encoder to enable embedding caching across chunks. It expects the encoder_forward_fn it receives to satisfy this contract:

Requirement Detail
Return type List[torch.Tensor]
Length Always 1 (all per-request embeddings concatenated into a single tensor), or [] when there is nothing to encode
Content result[0] is a 2-D torch.Tensor of shape [total_tokens, hidden_dim] across all requests in the batch
Auxiliary outputs Must not be returned as extra list elements; must be stored as side-channel data in each param.multimodal_data dict instead

The test helper _assert_compatible_with_chunked_prefill captures this precisely:

assert len(multimodal_embeddings) == 1
assert isinstance(multimodal_embeddings[0], torch.Tensor)

Why this matters for caching

get_multimodal_embeddings has an explicit guard:

if len(encoder_embeddings) > 1:
    logger.warning("Multiple modalities caching is not supported yet. ...")
    return encoder_embeddings   # ← caching is SKIPPED entirely

So if encoder_forward_fn returns anything other than a single-element list, the embedding is never cached, and the encoder re-runs on every chunk during chunked prefill — defeating the whole purpose.


What was broken before this PR

Before the fix, NemotronH_Nano_VL_V2._encode_multimodal was returning a tuple (mm_embeddings, mm_num_tokens):

  • mm_embeddings was a list of per-request tensors (length > 1 for batches)
  • mm_num_tokens was a separate auxiliary output for EVS video pruning

This violated both requirements: the length was > 1 and the elements weren't plain tensors. As a result, get_multimodal_embeddings fell into the warning/skip path and never cached any embeddings.


The fix

_encode_multimodal now:

  1. Concatenates all per-request embeddings into one tensor and returns [torch.cat(mm_embeddings, dim=0)]
  2. Stashes num_tokens_in_video (needed by EVS) as a side-channel into param.multimodal_data["num_tokens_in_video"] instead of returning it

The forward method then reads those stashed counts back when EVS pruning is needed:

# Retrieve per-video count stashed by `_encode_multimodal`.

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43673 [ run ] triggered by Bot. Commit: 1caf386 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43673 [ run ] completed with state FAILURE. Commit: 1caf386
/LLM/main/L0_MergeRequest_PR pipeline #34159 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@2ez4bz
Copy link
Copy Markdown
Collaborator Author

2ez4bz commented Apr 16, 2026

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43680 [ run ] triggered by Bot. Commit: 1caf386 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43680 [ run ] completed with state FAILURE. Commit: 1caf386
/LLM/main/L0_MergeRequest_PR pipeline #34164 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@yechank-nvidia
Copy link
Copy Markdown
Collaborator

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43728 [ run ] triggered by Bot. Commit: 1caf386 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43728 [ run ] completed with state FAILURE. Commit: 1caf386
/LLM/main/L0_MergeRequest_PR pipeline #34210 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@2ez4bz
Copy link
Copy Markdown
Collaborator Author

2ez4bz commented Apr 16, 2026

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43791 [ run ] triggered by Bot. Commit: 1caf386 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43791 [ run ] completed with state FAILURE. Commit: 1caf386

Link to invocation

@2ez4bz
Copy link
Copy Markdown
Collaborator Author

2ez4bz commented Apr 16, 2026

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43810 [ run ] triggered by Bot. Commit: 1caf386 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43810 [ run ] completed with state FAILURE. Commit: 1caf386
/LLM/main/L0_MergeRequest_PR pipeline #34286 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@2ez4bz
Copy link
Copy Markdown
Collaborator Author

2ez4bz commented Apr 17, 2026

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43909 [ run ] triggered by Bot. Commit: 1caf386 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43909 [ run ] completed with state FAILURE. Commit: 1caf386
/LLM/main/L0_MergeRequest_PR pipeline #34360 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@2ez4bz
Copy link
Copy Markdown
Collaborator Author

2ez4bz commented Apr 17, 2026

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43928 [ run ] triggered by Bot. Commit: 1caf386 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43928 [ run ] completed with state FAILURE. Commit: 1caf386
/LLM/main/L0_MergeRequest_PR pipeline #34376 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@2ez4bz
Copy link
Copy Markdown
Collaborator Author

2ez4bz commented Apr 17, 2026

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43945 [ run ] triggered by Bot. Commit: 1caf386 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43945 [ run ] completed with state SUCCESS. Commit: 1caf386
/LLM/main/L0_MergeRequest_PR pipeline #34391 completed with status: 'SUCCESS'

CI Report

Link to invocation

@2ez4bz 2ez4bz merged commit 91c1c14 into NVIDIA:main Apr 17, 2026
5 checks passed
@2ez4bz
Copy link
Copy Markdown
Collaborator Author

2ez4bz commented Apr 17, 2026

first try

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants