[None][fix] Fix chunked prefill API contract for nemotron nano VL by 2ez4bz · Pull Request #13025 · NVIDIA/TensorRT-LLM

2ez4bz · 2026-04-14T06:02:18Z

Summary by CodeRabbit

Improvements
- Enhanced diagnostic logging for multimodal embedding handling with more detailed information about returned tensors, embedding types, and token counts.
Tests
- Added comprehensive tests validating multimodal embedding caching behavior, encoder contract compliance, and chunked prefill compatibility.

Description

Why?

In order to opt into the caching functionality for chunked prefix, there are certain assumptions on the return type of the encoder's forward function. These assumptions did not hold for nemotron nano VL prior to this commit.

What?

This commit fixes this issue, and adds tests to catch regressions.

Test Coverage

See above.

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

coderabbitai · 2026-04-14T06:11:19Z

Caution

Review failed

An error occurred during the review process. Please try again later.

📝 Walkthrough

Walkthrough

These changes refactor multimodal embedding handling in Nemotron Nano models. The _encode_multimodal method now returns only a list of concatenated tensors instead of a tuple with token counts; token counts are stored directly in multimodal parameters. The forward method reconstructs token counts from context parameters when needed. Enhanced logging provides more detailed warning information for debugging.

Changes

Cohort / File(s)	Summary
Logging Enhancement `tensorrt_llm/_torch/models/modeling_multimodal_utils.py`	Enhanced `logger.warning` message to include number of returned embeddings, embedding type names, shapes, count of uncached multimodal params, and encoder function reference when `encoder_forward_fn` returns multiple tensors.
Core API Changes `tensorrt_llm/_torch/models/modeling_nemotron_nano.py`	Modified `_encode_multimodal` to return `List[torch.Tensor]` instead of tuple with token counts. Stores per-request `num_tokens_in_video` in `MultimodalParams.multimodal_data`. Updated `forward` to reconstruct `num_tokens_in_videos` from context params. Changed `merge_evs_mm_embeds` invocation to use sliced context params.
Test Updates `tests/unittest/_torch/modeling/test_modeling_nemotron_nano_v2_vl.py`	Added imports and comprehensive tests for multimodal caching contract. Updated existing dispatch tests for new return signature. Introduced `TestEncodeMultimodalContract` validating list-of-single-tensor return, and `TestChunkedPrefillCaching` verifying encoder caching behavior.

Sequence Diagram(s)

sequenceDiagram
    participant Forward as Forward()
    participant Encoder as _encode_multimodal()
    participant Cache as MultimodalParams<br/>(Cache Storage)
    participant Merge as merge_evs_mm_embeds()
    participant Context as Context Params

    Forward->>Encoder: Call with uncached params
    Encoder->>Encoder: Vision/Audio encoding
    Encoder->>Cache: Store embeddings in<br/>multimodal_data
    Encoder->>Cache: Store num_tokens_in_video
    Encoder-->>Forward: Return List[Tensor]
    Forward->>Merge: Pass merged embeddings<br/>+ context params
    Merge-->>Forward: Merged result
    Forward->>Context: Reconstruct num_tokens<br/>from cached values
    Forward-->>Forward: Continue with<br/>merged embeddings

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly summarizes the main change: fixing the chunked prefill API contract for nemotron nano VL, which aligns with the primary objective and code modifications across all three files.
Description check	✅ Passed	The description adequately explains the issue (assumptions about encoder return type not holding for nemotron nano VL), the solution (fixing the API contract and adding tests), and provides test coverage details.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

2ez4bz · 2026-04-14T16:53:47Z

/bot run

tensorrt-cicd · 2026-04-14T17:02:41Z

PR_Github #43265 [ run ] triggered by Bot. Commit: e7868dd Link to invocation

tensorrt-cicd · 2026-04-14T18:53:49Z

PR_Github #43265 [ run ] completed with state SUCCESS. Commit: e7868dd
/LLM/main/L0_MergeRequest_PR pipeline #33818 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

2ez4bz · 2026-04-15T03:53:43Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-04-15T03:59:17Z

PR_Github #43372 [ run ] triggered by Bot. Commit: e7868dd Link to invocation

tensorrt-cicd · 2026-04-15T07:56:13Z

PR_Github #43372 [ run ] completed with state SUCCESS. Commit: e7868dd
/LLM/main/L0_MergeRequest_PR pipeline #33907 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Wanli-Jiang · 2026-04-15T07:59:21Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-04-15T08:07:00Z

PR_Github #43443 [ run ] triggered by Bot. Commit: e7868dd Link to invocation

tensorrt-cicd · 2026-04-15T11:48:32Z

PR_Github #43443 [ run ] completed with state SUCCESS. Commit: e7868dd
/LLM/main/L0_MergeRequest_PR pipeline #33971 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Wanli-Jiang · 2026-04-15T12:35:55Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-04-15T12:52:09Z

PR_Github #43486 [ run ] triggered by Bot. Commit: e7868dd Link to invocation

tensorrt-cicd · 2026-04-15T13:19:52Z

PR_Github #43486 [ run ] completed with state FAILURE. Commit: e7868dd
/LLM/main/L0_MergeRequest_PR pipeline #34006 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

2ez4bz · 2026-04-15T16:37:22Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-04-15T16:44:11Z

PR_Github #43540 [ run ] triggered by Bot. Commit: e7868dd Link to invocation

tensorrt-cicd · 2026-04-15T17:54:44Z

PR_Github #43540 [ run ] completed with state SUCCESS. Commit: e7868dd
/LLM/main/L0_MergeRequest_PR pipeline #34045 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

2ez4bz · 2026-04-15T18:32:20Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-04-15T18:38:33Z

PR_Github #43556 [ run ] triggered by Bot. Commit: e7868dd Link to invocation

tensorrt-cicd · 2026-04-15T19:41:15Z

PR_Github #43556 [ run ] completed with state SUCCESS. Commit: e7868dd
/LLM/main/L0_MergeRequest_PR pipeline #34056 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

* Why? In order to opt into the caching functionality for chunked prefix, there are certain assumptions on the return type of the encoder's forward function. These assumptions did not hold for nemotron nano VL prior to this commit. * What? This commit fixes this issue, and adds tests to catch regressions. Signed-off-by: William Zhang <[email protected]>

2ez4bz · 2026-04-15T19:49:41Z

/bot run --disable-fail-fast

2ez4bz · 2026-04-16T05:56:05Z

/bot run

2ez4bz · 2026-04-16T06:20:18Z

/bot run

venkywonka · 2026-04-16T06:21:58Z

@coderabbitai what's the api contract described here?

coderabbitai · 2026-04-16T06:23:12Z

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
cat tensorrt_llm/_torch/models/modeling_multimodal_utils.py | grep -n "encoder_forward_fn\|chunked_prefill\|cach\|api contract\|contract\|List\[torch\|return" | head -60

echo "---"
cat tensorrt_llm/_torch/models/modeling_nemotron_nano.py | grep -n "_encode_multimodal\|chunked_prefill\|num_tokens\|List\[torch\|return\|multimodal_data\|multimodal_embedding" | head -80

echo "---"
cat tests/unittest/_torch/modeling/test_modeling_nemotron_nano_v2_vl.py | grep -n "_assert_compatible\|chunked_prefill\|contract\|List\|length 1\|single tensor\|cach" | head -60

Length of output: 10999

🏁 Script executed:

#!/bin/bash
# Get the _assert_compatible_with_chunked_prefill function and surrounding context
sed -n '220,300p' tests/unittest/_torch/modeling/test_modeling_nemotron_nano_v2_vl.py

echo "=== get_multimodal_embeddings docstring ==="
sed -n '110,200p' tensorrt_llm/_torch/models/modeling_multimodal_utils.py

echo "=== _encode_multimodal in nemotron_nano ==="
sed -n '680,760p' tensorrt_llm/_torch/models/modeling_nemotron_nano.py

Length of output: 11619

🏁 Script executed:

#!/bin/bash
# Get the _encode_multimodal method in the NemotronH_Nano_VL_V2 class
grep -n "_encode_multimodal\|def forward\|def _encode_audio\|def _encode_image\|def _encode_video" tensorrt_llm/_torch/models/modeling_nemotron_nano.py | head -30

echo "==="
# Find and print _encode_multimodal method
ast-grep --pattern 'def _encode_multimodal($$$) {
  $$$
}'

echo "=== find _encode_multimodal in nemotron_nano.py ==="
grep -n "_encode_multimodal" tensorrt_llm/_torch/models/modeling_nemotron_nano.py

echo "=== read _encode_multimodal context ==="
# Find line number and print surrounding lines
LINE=$(grep -n "def _encode_multimodal" tensorrt_llm/_torch/models/modeling_nemotron_nano.py | head -1 | cut -d: -f1)
sed -n "${LINE},$((LINE+60))p" tensorrt_llm/_torch/models/modeling_nemotron_nano.py

Length of output: 3401

@venkywonka, here's a breakdown of the chunked prefill API contract as described in this PR:

The Contract: `encoder_forward_fn` must return `List[torch.Tensor]` of length exactly 1

The central utility get_multimodal_embeddings (in modeling_multimodal_utils.py) wraps the vision/audio encoder to enable embedding caching across chunks. It expects the encoder_forward_fn it receives to satisfy this contract:

Requirement	Detail
Return type	`List[torch.Tensor]`
Length	Always `1` (all per-request embeddings concatenated into a single tensor), or `[]` when there is nothing to encode
Content	`result[0]` is a 2-D `torch.Tensor` of shape `[total_tokens, hidden_dim]` across all requests in the batch
Auxiliary outputs	Must not be returned as extra list elements; must be stored as side-channel data in each `param.multimodal_data` dict instead

The test helper _assert_compatible_with_chunked_prefill captures this precisely:

assert len(multimodal_embeddings) == 1
assert isinstance(multimodal_embeddings[0], torch.Tensor)

Why this matters for caching

get_multimodal_embeddings has an explicit guard:

if len(encoder_embeddings) > 1:
    logger.warning("Multiple modalities caching is not supported yet. ...")
    return encoder_embeddings   # ← caching is SKIPPED entirely

So if encoder_forward_fn returns anything other than a single-element list, the embedding is never cached, and the encoder re-runs on every chunk during chunked prefill — defeating the whole purpose.

What was broken before this PR

Before the fix, NemotronH_Nano_VL_V2._encode_multimodal was returning a tuple (mm_embeddings, mm_num_tokens):

mm_embeddings was a list of per-request tensors (length > 1 for batches)
mm_num_tokens was a separate auxiliary output for EVS video pruning

This violated both requirements: the length was > 1 and the elements weren't plain tensors. As a result, get_multimodal_embeddings fell into the warning/skip path and never cached any embeddings.

The fix

_encode_multimodal now:

Concatenates all per-request embeddings into one tensor and returns [torch.cat(mm_embeddings, dim=0)]
Stashes num_tokens_in_video (needed by EVS) as a side-channel into param.multimodal_data["num_tokens_in_video"] instead of returning it

The forward method then reads those stashed counts back when EVS pruning is needed:

# Retrieve per-video count stashed by `_encode_multimodal`.

tensorrt-cicd · 2026-04-16T06:26:11Z

PR_Github #43673 [ run ] triggered by Bot. Commit: 1caf386 Link to invocation

tensorrt-cicd · 2026-04-16T06:41:19Z

PR_Github #43673 [ run ] completed with state FAILURE. Commit: 1caf386
/LLM/main/L0_MergeRequest_PR pipeline #34159 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

2ez4bz · 2026-04-16T06:52:51Z

/bot run

tensorrt-cicd · 2026-04-16T06:59:45Z

PR_Github #43680 [ run ] triggered by Bot. Commit: 1caf386 Link to invocation

tensorrt-cicd · 2026-04-16T07:11:19Z

PR_Github #43680 [ run ] completed with state FAILURE. Commit: 1caf386
/LLM/main/L0_MergeRequest_PR pipeline #34164 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

yechank-nvidia · 2026-04-16T08:22:29Z

/bot run

tensorrt-cicd · 2026-04-16T08:28:27Z

PR_Github #43728 [ run ] triggered by Bot. Commit: 1caf386 Link to invocation

tensorrt-cicd · 2026-04-16T10:04:47Z

PR_Github #43728 [ run ] completed with state FAILURE. Commit: 1caf386
/LLM/main/L0_MergeRequest_PR pipeline #34210 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

2ez4bz · 2026-04-16T15:06:59Z

/bot run

tensorrt-cicd · 2026-04-16T15:12:57Z

PR_Github #43791 [ run ] triggered by Bot. Commit: 1caf386 Link to invocation

tensorrt-cicd · 2026-04-16T15:14:27Z

PR_Github #43791 [ run ] completed with state FAILURE. Commit: 1caf386

Link to invocation

2ez4bz · 2026-04-16T16:10:17Z

/bot run

tensorrt-cicd · 2026-04-16T16:38:53Z

PR_Github #43810 [ run ] triggered by Bot. Commit: 1caf386 Link to invocation

tensorrt-cicd · 2026-04-17T01:13:15Z

PR_Github #43810 [ run ] completed with state FAILURE. Commit: 1caf386
/LLM/main/L0_MergeRequest_PR pipeline #34286 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

2ez4bz · 2026-04-17T02:54:04Z

/bot run

tensorrt-cicd · 2026-04-17T03:02:06Z

PR_Github #43909 [ run ] triggered by Bot. Commit: 1caf386 Link to invocation

tensorrt-cicd · 2026-04-17T03:15:07Z

PR_Github #43909 [ run ] completed with state FAILURE. Commit: 1caf386
/LLM/main/L0_MergeRequest_PR pipeline #34360 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

2ez4bz · 2026-04-17T03:27:11Z

/bot run

tensorrt-cicd · 2026-04-17T03:34:31Z

PR_Github #43928 [ run ] triggered by Bot. Commit: 1caf386 Link to invocation

tensorrt-cicd · 2026-04-17T03:52:02Z

PR_Github #43928 [ run ] completed with state FAILURE. Commit: 1caf386
/LLM/main/L0_MergeRequest_PR pipeline #34376 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

2ez4bz · 2026-04-17T03:53:04Z

/bot run

tensorrt-cicd · 2026-04-17T04:02:46Z

PR_Github #43945 [ run ] triggered by Bot. Commit: 1caf386 Link to invocation

tensorrt-cicd · 2026-04-17T08:47:28Z

PR_Github #43945 [ run ] completed with state SUCCESS. Commit: 1caf386
/LLM/main/L0_MergeRequest_PR pipeline #34391 completed with status: 'SUCCESS'

CI Report

Link to invocation

2ez4bz · 2026-04-17T14:49:40Z

2ez4bz requested a review from a team as a code owner April 14, 2026 06:02

2ez4bz requested a review from symphonylyh April 14, 2026 06:02

github-actions Bot assigned 2ez4bz Apr 14, 2026

2ez4bz force-pushed the dev-nano-v3-fix-chunked-prefill-api branch 2 times, most recently from f4313d5 to 35608fe Compare April 14, 2026 06:09

yechank-nvidia approved these changes Apr 14, 2026

View reviewed changes

2ez4bz force-pushed the dev-nano-v3-fix-chunked-prefill-api branch from 35608fe to e7868dd Compare April 14, 2026 16:52

2ez4bz enabled auto-merge (squash) April 14, 2026 17:54

2ez4bz force-pushed the dev-nano-v3-fix-chunked-prefill-api branch from e7868dd to 1caf386 Compare April 15, 2026 19:49

2ez4bz mentioned this pull request Apr 16, 2026

[None][feat] DO NOT MERGE: Stacked changes for nano nemotron omni #13138

Closed

1 task

2ez4bz merged commit 91c1c14 into NVIDIA:main Apr 17, 2026
5 checks passed

Conversation

2ez4bz commented Apr 14, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

coderabbitai Bot commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

2ez4bz commented Apr 14, 2026

Uh oh!

tensorrt-cicd commented Apr 14, 2026

Uh oh!

tensorrt-cicd commented Apr 14, 2026

Uh oh!

2ez4bz commented Apr 15, 2026

Uh oh!

tensorrt-cicd commented Apr 15, 2026

Uh oh!

tensorrt-cicd commented Apr 15, 2026

Uh oh!

Wanli-Jiang commented Apr 15, 2026

Uh oh!

tensorrt-cicd commented Apr 15, 2026

Uh oh!

tensorrt-cicd commented Apr 15, 2026

Uh oh!

Wanli-Jiang commented Apr 15, 2026

Uh oh!

tensorrt-cicd commented Apr 15, 2026

Uh oh!

tensorrt-cicd commented Apr 15, 2026

Uh oh!

2ez4bz commented Apr 15, 2026

Uh oh!

tensorrt-cicd commented Apr 15, 2026

Uh oh!

tensorrt-cicd commented Apr 15, 2026

Uh oh!

2ez4bz commented Apr 15, 2026

Uh oh!

tensorrt-cicd commented Apr 15, 2026

Uh oh!

tensorrt-cicd commented Apr 15, 2026

Uh oh!

2ez4bz commented Apr 15, 2026

Uh oh!

2ez4bz commented Apr 16, 2026

Uh oh!

2ez4bz commented Apr 16, 2026

Uh oh!

venkywonka commented Apr 16, 2026

Uh oh!

coderabbitai Bot commented Apr 16, 2026

The Contract: encoder_forward_fn must return List[torch.Tensor] of length exactly 1

Why this matters for caching

What was broken before this PR

The fix

Uh oh!

tensorrt-cicd commented Apr 16, 2026

Uh oh!

tensorrt-cicd commented Apr 16, 2026

Uh oh!

2ez4bz commented Apr 16, 2026

Uh oh!

tensorrt-cicd commented Apr 16, 2026

Uh oh!

tensorrt-cicd commented Apr 16, 2026

Uh oh!

yechank-nvidia commented Apr 16, 2026

Uh oh!

2ez4bz commented Apr 14, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 14, 2026 •

edited

Loading

The Contract: `encoder_forward_fn` must return `List[torch.Tensor]` of length exactly 1