[https://nvbugs/5961414][fix] Pre-cache aesthetic predictor weights to avoid VBench 429 errors by chang-l · Pull Request #12127 · NVIDIA/TensorRT-LLM

chang-l · 2026-03-11T23:41:18Z

Summary

Pre-download LAION aesthetic predictor weights (sa_0_4_vit_l_14_linear.pth) to ~/.cache/emb_reader/ with retries and exponential backoff before VBench evaluation, preventing GitHub HTTP 429 (Too Many Requests) failures in CI
Follows the same pre-caching pattern already used for DINO (_precache_dino_for_torch_hub)
Unwaive the 3 VBench visual_gen tests that were blocked by this issue

Test plan

Run pytest tests/integration/defs/examples/test_visual_gen.py -v -k "test_vbench_dimension_score" on a B200 node with model weights available
Verify the aesthetic predictor file is cached at ~/.cache/emb_reader/sa_0_4_vit_l_14_linear.pth after first run
Verify subsequent runs skip the download

🤖 Generated with Claude Code

Summary by CodeRabbit

Tests
- Improved test infrastructure reliability with automated resource prefetching to enhance test execution stability and reduce intermittent failures during test runs
- Removed 6 previously skipped tests from the active suite as their underlying issues have been fully resolved

…o avoid VBench 429 errors VBench's aesthetic_quality dimension downloads sa_0_4_vit_l_14_linear.pth from GitHub via wget at evaluation time. In CI environments GitHub often returns HTTP 429 (Too Many Requests), causing the test to fail. Pre-download the file to ~/.cache/emb_reader/ with retries and exponential backoff before VBench evaluation runs, following the same pattern used for the DINO torch.hub pre-cache. Also unwaive the 3 VBench visual_gen tests that were blocked by this issue. Signed-off-by: Chang Liu <[email protected]>

coderabbitai · 2026-03-11T23:45:04Z

📝 Walkthrough

Walkthrough

Added prefetching logic for LAION aesthetic predictor weights with exponential backoff retry mechanism to avoid GitHub rate limits. Removed test skip waivers for visual generation and performance evaluation tests that are now expected to pass.

Changes

Cohort / File(s)	Summary
Aesthetic Predictor Prefetch `tests/integration/defs/examples/test_visual_gen.py`	Added three constants (`AESTHETIC_PREDICTOR_URL`, `AESTHETIC_PREDICTOR_FILENAME`, `AESTHETIC_PREDICTOR_CACHE_DIR`), a download helper function with exponential backoff retry logic, and prefetch invocation in repository setup paths to preload weights and avoid GitHub rate limits.
Test Waive Removal `tests/integration/test_lists/waives.txt`	Removed six SKIP entries for visual generation dimension score tests (`test_vbench_dimension_score_wan*`) and performance/accuracy tests related to disaggregated serving and DeepSeek models, indicating these tests are now expected to pass.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically describes the main change: pre-caching aesthetic predictor weights to fix VBench 429 errors, which aligns with the core objective of the PR.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description check	✅ Passed	The PR description covers key sections but lacks explicit test coverage details and PR checklist completion.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/integration/defs/examples/test_visual_gen.py`:
- Around line 178-192: Change the broad except around urllib.request.urlretrieve
to catch specific exceptions (urllib.error.URLError and OSError) and protect
against partial downloads by writing to a temporary path (e.g., cached_path +
".tmp") and only renaming/moving the temp file to cached_path on successful
completion; ensure the temp file is removed on failure before retrying and that
the retry/backoff logic using attempt and max_retries remains the same so
corrupted partial files are never left at cached_path.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: ce95000a-8987-4c50-81b6-009f65106071

📥 Commits

Reviewing files that changed from the base of the PR and between bf7142f and c0b7f45.

📒 Files selected for processing (2)

tests/integration/defs/examples/test_visual_gen.py
tests/integration/test_lists/waives.txt

💤 Files with no reviewable changes (1)

tests/integration/test_lists/waives.txt

chang-l · 2026-03-12T03:54:14Z

/bot run --extra-stage "DGX_B200-4_GPUs-PyTorch-Post-Merge-1, DGX_B200-4_GPUs-PyTorch-Post-Merge-2"

tensorrt-cicd · 2026-03-12T04:00:40Z

PR_Github #38663 [ run ] triggered by Bot. Commit: c0b7f45 Link to invocation

tensorrt-cicd · 2026-03-12T06:16:26Z

PR_Github #38663 [ run ] completed with state SUCCESS. Commit: c0b7f45
/LLM/main/L0_MergeRequest_PR pipeline #29989 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Signed-off-by: Chang Liu <[email protected]>

chang-l · 2026-03-13T19:13:05Z

/bot run --extra-stage "DGX_B200-4_GPUs-PyTorch-Post-Merge-1, DGX_B200-4_GPUs-PyTorch-Post-Merge-2"

tensorrt-cicd · 2026-03-13T19:19:38Z

PR_Github #38894 [ run ] triggered by Bot. Commit: 63e7514 Link to invocation

tensorrt-cicd · 2026-03-13T20:29:01Z

PR_Github #38894 [ run ] completed with state SUCCESS. Commit: 63e7514
/LLM/main/L0_MergeRequest_PR pipeline #30202 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

chang-l · 2026-03-13T20:31:40Z

/bot run --disable-fail-fast --extra-stage "DGX_B200-4_GPUs-PyTorch-Post-Merge-1, DGX_B200-4_GPUs-PyTorch-Post-Merge-2"

tensorrt-cicd · 2026-03-13T20:39:13Z

PR_Github #38899 [ run ] triggered by Bot. Commit: 63e7514 Link to invocation

tensorrt-cicd · 2026-03-14T02:43:23Z

PR_Github #38899 [ run ] completed with state FAILURE. Commit: 63e7514
/LLM/main/L0_MergeRequest_PR pipeline #30208 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Use raw.githubusercontent.com CDN URL instead of GitHub blob redirect, add User-Agent header, increase retries to 8 with longer exponential backoff (10-120s + jitter), and use atomic file writes to prevent corruption from partial downloads. Signed-off-by: Chang Liu <[email protected]> Signed-off-by: Chang Liu <[email protected]>

chang-l · 2026-03-14T03:55:33Z

/bot run --disable-fail-fast --extra-stage "DGX_B200-4_GPUs-PyTorch-Post-Merge-1, DGX_B200-4_GPUs-PyTorch-Post-Merge-2"

tensorrt-cicd · 2026-03-14T04:01:54Z

PR_Github #38931 [ run ] triggered by Bot. Commit: 3d6f10d Link to invocation

tensorrt-cicd · 2026-03-14T04:01:54Z

PR_Github #38931 [ run ] completed with state DISABLED
CI server is currently disabled for scheduled maintenance. Estimated completion time: 9 PM PST on 3/14.

Link to invocation

chang-l · 2026-03-14T06:31:02Z

/bot run --disable-fail-fast --extra-stage "DGX_B200-4_GPUs-PyTorch-Post-Merge-1, DGX_B200-4_GPUs-PyTorch-Post-Merge-2"

tensorrt-cicd · 2026-03-14T06:37:22Z

PR_Github #38933 [ run ] triggered by Bot. Commit: 3d6f10d Link to invocation

tensorrt-cicd · 2026-03-14T06:37:22Z

PR_Github #38933 [ run ] completed with state DISABLED
CI server is currently disabled for scheduled maintenance. Estimated completion time: 9 PM PST on 3/14.

Link to invocation

chang-l · 2026-03-14T08:00:45Z

/bot run --disable-fail-fast --extra-stage "DGX_B200-4_GPUs-PyTorch-Post-Merge-1, DGX_B200-4_GPUs-PyTorch-Post-Merge-2"

tensorrt-cicd · 2026-03-14T08:06:38Z

PR_Github #38936 [ run ] triggered by Bot. Commit: 3d6f10d Link to invocation

tensorrt-cicd · 2026-03-14T08:06:39Z

PR_Github #38936 [ run ] completed with state DISABLED
CI server is currently disabled for scheduled maintenance. Estimated completion time: 9 PM PST on 3/14.

Link to invocation

chang-l · 2026-03-14T09:30:43Z

/bot run --disable-fail-fast --extra-stage "DGX_B200-4_GPUs-PyTorch-Post-Merge-1, DGX_B200-4_GPUs-PyTorch-Post-Merge-2"

tensorrt-cicd · 2026-03-14T09:36:38Z

PR_Github #38938 [ run ] triggered by Bot. Commit: 3d6f10d Link to invocation

tensorrt-cicd · 2026-03-14T09:36:38Z

PR_Github #38938 [ run ] completed with state DISABLED
CI server is currently disabled for scheduled maintenance. Estimated completion time: 9 PM PST on 3/14.

Link to invocation

chang-l · 2026-03-14T11:00:45Z

/bot run --disable-fail-fast --extra-stage "DGX_B200-4_GPUs-PyTorch-Post-Merge-1, DGX_B200-4_GPUs-PyTorch-Post-Merge-2"

chang-l · 2026-03-17T23:03:25Z

/bot run --extra-stage "DGX_B200-4_GPUs-PyTorch-Post-Merge-1, DGX_B200-4_GPUs-PyTorch-Post-Merge-2"

tensorrt-cicd · 2026-03-17T23:10:24Z

PR_Github #39328 [ run ] triggered by Bot. Commit: 10019f8 Link to invocation

tensorrt-cicd · 2026-03-18T04:27:23Z

PR_Github #39328 [ run ] completed with state SUCCESS. Commit: 10019f8
/LLM/main/L0_MergeRequest_PR pipeline #30575 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

chang-l · 2026-03-18T04:34:47Z

/bot run --extra-stage "DGX_B200-4_GPUs-PyTorch-Post-Merge-1, DGX_B200-4_GPUs-PyTorch-Post-Merge-2"

chang-l · 2026-03-18T05:03:48Z

/bot run --extra-stage "DGX_B200-4_GPUs-PyTorch-Post-Merge-1, DGX_B200-4_GPUs-PyTorch-Post-Merge-2"

tensorrt-cicd · 2026-03-18T05:10:07Z

PR_Github #39383 [ run ] triggered by Bot. Commit: 10019f8 Link to invocation

tensorrt-cicd · 2026-03-18T06:07:59Z

PR_Github #39383 [ run ] completed with state SUCCESS. Commit: 10019f8
/LLM/main/L0_MergeRequest_PR pipeline #30624 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

MediaStorage.save_video() requires the ffmpeg CLI to encode MP4 output. Install it via apt-get in the _visual_gen_deps test fixture so the VBench dimension score tests can complete successfully in CI. Signed-off-by: Chang Liu <[email protected]>

chang-l · 2026-03-18T20:33:18Z

/bot run --extra-stage "DGX_B200-4_GPUs-PyTorch-Post-Merge-1, DGX_B200-4_GPUs-PyTorch-Post-Merge-2"

tensorrt-cicd · 2026-03-18T20:39:18Z

PR_Github #39503 [ run ] triggered by Bot. Commit: 25401e5 Link to invocation

tensorrt-cicd · 2026-03-19T02:17:35Z

PR_Github #39503 [ run ] completed with state SUCCESS. Commit: 25401e5
/LLM/main/L0_MergeRequest_PR pipeline #30724 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

apt-get install without a prior update returns exit code 100 (package not found) in CI containers with stale package lists. Signed-off-by: Chang Liu <[email protected]>

chang-l · 2026-03-19T03:50:53Z

/bot run --extra-stage "DGX_B200-4_GPUs-PyTorch-Post-Merge-1, DGX_B200-4_GPUs-PyTorch-Post-Merge-2"

chang-l · 2026-03-19T04:03:02Z

/bot run --extra-stage "DGX_B200-4_GPUs-PyTorch-Post-Merge-1, DGX_B200-4_GPUs-PyTorch-Post-Merge-2"

…ench Resolve waives.txt conflict: keep both TestNemotronNanoV3 waive and the VBench test waives from main (added as temporary workaround while the 429 fix was in progress). Signed-off-by: Chang Liu <[email protected]>

chang-l · 2026-03-19T04:34:58Z

/bot run --extra-stage "DGX_B200-4_GPUs-PyTorch-Post-Merge-1, DGX_B200-4_GPUs-PyTorch-Post-Merge-2"

tensorrt-cicd · 2026-03-19T04:44:29Z

PR_Github #39539 [ run ] triggered by Bot. Commit: e3dd019 Link to invocation

tensorrt-cicd · 2026-03-19T10:48:55Z

PR_Github #39539 [ run ] completed with state SUCCESS. Commit: e3dd019
/LLM/main/L0_MergeRequest_PR pipeline #30758 completed with status: 'SUCCESS'

CI Report

Link to invocation

Remove the 5 test_vbench_dimension_score_* waives (NVBug 5961414) since this PR fixes the underlying 429 rate-limit and ffmpeg issues that caused them to fail. Signed-off-by: Chang Liu <[email protected]>

chang-l · 2026-03-20T00:43:21Z

/bot run --extra-stage "DGX_B200-4_GPUs-PyTorch-Post-Merge-1, DGX_B200-4_GPUs-PyTorch-Post-Merge-2"

This waive was accidentally added during merge conflict resolution and does not belong in this PR. Signed-off-by: Chang Liu <[email protected]>

chang-l · 2026-03-20T00:49:52Z

/bot run --disable-fail-fast --extra-stage "DGX_B200-4_GPUs-PyTorch-Post-Merge-1, DGX_B200-4_GPUs-PyTorch-Post-Merge-2"

tensorrt-cicd · 2026-03-20T00:49:56Z

PR_Github #39654 [ run ] triggered by Bot. Commit: 262dc4b Link to invocation

tensorrt-cicd · 2026-03-20T00:56:18Z

PR_Github #39658 [ run ] triggered by Bot. Commit: 262dc4b Link to invocation

tensorrt-cicd · 2026-03-20T07:09:51Z

PR_Github #39658 [ run ] completed with state SUCCESS. Commit: 262dc4b
/LLM/main/L0_MergeRequest_PR pipeline #30863 completed with status: 'SUCCESS'

CI Report

Link to invocation

chang-l · 2026-03-20T18:51:42Z

Also, @yibinl-nvidia , it seems in your PR #12009, the subpath ltx-video-2-0.9.7

TensorRT-LLM/tests/integration/defs/examples/test_visual_gen.py

Line 38 in 4197e19

LTX2_MODEL_SUBPATH = "ltx-video-2-0.9.7"

doesn't match the actual CI model directory name LTX-2. Could you fix this in a follow-up PR? Right now the LTX2 tests are being skipped because the checkpoint can’t be found.

yibinl-nvidia · 2026-03-23T22:20:04Z

Also, @yibinl-nvidia , it seems in your PR #12009, the subpath ltx-video-2-0.9.7

TensorRT-LLM/tests/integration/defs/examples/test_visual_gen.py

Line 38 in 4197e19

LTX2_MODEL_SUBPATH = "ltx-video-2-0.9.7"

doesn't match the actual CI model directory name LTX-2. Could you fix this in a follow-up PR? Right now the LTX2 tests are being skipped because the checkpoint can’t be found.

Thanks Chang for flagging this, should be fixed by #12463

…o avoid VBench 429 errors (NVIDIA#12127) Signed-off-by: Chang Liu <[email protected]> Signed-off-by: Chang Liu <[email protected]> Signed-off-by: Shreyas Misra <[email protected]> Co-authored-by: Shreyas Misra <[email protected]>

chang-l requested a review from a team as a code owner March 11, 2026 23:41

github-actions Bot assigned chang-l Mar 11, 2026

coderabbitai Bot reviewed Mar 11, 2026

View reviewed changes

Comment thread tests/integration/defs/examples/test_visual_gen.py

chang-l requested a review from yibinl-nvidia March 12, 2026 00:28

Merge branch 'main' into fix/precache-aesthetic-predictor-vbench

63e7514

Signed-off-by: Chang Liu <[email protected]>

[None][fix] Run apt-get update before installing ffmpeg

a35986b

apt-get install without a prior update returns exit code 100 (package not found) in CI containers with stale package lists. Signed-off-by: Chang Liu <[email protected]>

zhenhuaw-me approved these changes Mar 20, 2026

View reviewed changes

Comment thread tests/integration/test_lists/waives.txt Outdated

Comment thread tests/integration/defs/examples/test_visual_gen.py

Comment thread tensorrt_llm/_torch/visual_gen/modules/vae/parallel_vae_interface.py

[None][fix] Remove VBench dimension score test waives

e095108

Remove the 5 test_vbench_dimension_score_* waives (NVBug 5961414) since this PR fixes the underlying 429 rate-limit and ffmpeg issues that caused them to fail. Signed-off-by: Chang Liu <[email protected]>

[None][fix] Remove unrelated TestNemotronNanoV3 waive

262dc4b

This waive was accidentally added during merge conflict resolution and does not belong in this PR. Signed-off-by: Chang Liu <[email protected]>

chang-l merged commit 98c7683 into NVIDIA:main Mar 20, 2026
5 checks passed

Conversation

chang-l commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

chang-l commented Mar 12, 2026

Uh oh!

tensorrt-cicd commented Mar 12, 2026

Uh oh!

tensorrt-cicd commented Mar 12, 2026

Uh oh!

chang-l commented Mar 13, 2026

Uh oh!

tensorrt-cicd commented Mar 13, 2026

Uh oh!

tensorrt-cicd commented Mar 13, 2026

Uh oh!

chang-l commented Mar 13, 2026

Uh oh!

tensorrt-cicd commented Mar 13, 2026

Uh oh!

tensorrt-cicd commented Mar 14, 2026

Uh oh!

chang-l commented Mar 14, 2026

Uh oh!

tensorrt-cicd commented Mar 14, 2026

Uh oh!

tensorrt-cicd commented Mar 14, 2026

Uh oh!

chang-l commented Mar 14, 2026

Uh oh!

tensorrt-cicd commented Mar 14, 2026

Uh oh!

tensorrt-cicd commented Mar 14, 2026

Uh oh!

chang-l commented Mar 14, 2026

Uh oh!

tensorrt-cicd commented Mar 14, 2026

Uh oh!

tensorrt-cicd commented Mar 14, 2026

Uh oh!

chang-l commented Mar 14, 2026

Uh oh!

tensorrt-cicd commented Mar 14, 2026

Uh oh!

tensorrt-cicd commented Mar 14, 2026

Uh oh!

chang-l commented Mar 14, 2026

Uh oh!

chang-l commented Mar 17, 2026

Uh oh!

tensorrt-cicd commented Mar 17, 2026

Uh oh!

tensorrt-cicd commented Mar 18, 2026

Uh oh!

chang-l commented Mar 18, 2026

Uh oh!

chang-l commented Mar 18, 2026

Uh oh!

tensorrt-cicd commented Mar 18, 2026

Uh oh!

tensorrt-cicd commented Mar 18, 2026

Uh oh!

chang-l commented Mar 18, 2026

Uh oh!

tensorrt-cicd commented Mar 18, 2026

Uh oh!

tensorrt-cicd commented Mar 19, 2026

chang-l commented Mar 11, 2026 •

edited

Loading

coderabbitai Bot commented Mar 11, 2026 •

edited

Loading