Skip to content

[https://nvbugs/5961414][fix] Pre-cache aesthetic predictor weights to avoid VBench 429 errors#12127

Merged
chang-l merged 10 commits into
NVIDIA:mainfrom
chang-l:fix/precache-aesthetic-predictor-vbench
Mar 20, 2026
Merged

[https://nvbugs/5961414][fix] Pre-cache aesthetic predictor weights to avoid VBench 429 errors#12127
chang-l merged 10 commits into
NVIDIA:mainfrom
chang-l:fix/precache-aesthetic-predictor-vbench

Conversation

@chang-l
Copy link
Copy Markdown
Collaborator

@chang-l chang-l commented Mar 11, 2026

Summary

  • Pre-download LAION aesthetic predictor weights (sa_0_4_vit_l_14_linear.pth) to ~/.cache/emb_reader/ with retries and exponential backoff before VBench evaluation, preventing GitHub HTTP 429 (Too Many Requests) failures in CI
  • Follows the same pre-caching pattern already used for DINO (_precache_dino_for_torch_hub)
  • Unwaive the 3 VBench visual_gen tests that were blocked by this issue

Test plan

  • Run pytest tests/integration/defs/examples/test_visual_gen.py -v -k "test_vbench_dimension_score" on a B200 node with model weights available
  • Verify the aesthetic predictor file is cached at ~/.cache/emb_reader/sa_0_4_vit_l_14_linear.pth after first run
  • Verify subsequent runs skip the download

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Tests
    • Improved test infrastructure reliability with automated resource prefetching to enhance test execution stability and reduce intermittent failures during test runs
    • Removed 6 previously skipped tests from the active suite as their underlying issues have been fully resolved

…o avoid VBench 429 errors

VBench's aesthetic_quality dimension downloads sa_0_4_vit_l_14_linear.pth
from GitHub via wget at evaluation time. In CI environments GitHub often
returns HTTP 429 (Too Many Requests), causing the test to fail.

Pre-download the file to ~/.cache/emb_reader/ with retries and exponential
backoff before VBench evaluation runs, following the same pattern used for
the DINO torch.hub pre-cache.

Also unwaive the 3 VBench visual_gen tests that were blocked by this issue.

Signed-off-by: Chang Liu <[email protected]>
@chang-l chang-l requested a review from a team as a code owner March 11, 2026 23:41
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 11, 2026

📝 Walkthrough

Walkthrough

Added prefetching logic for LAION aesthetic predictor weights with exponential backoff retry mechanism to avoid GitHub rate limits. Removed test skip waivers for visual generation and performance evaluation tests that are now expected to pass.

Changes

Cohort / File(s) Summary
Aesthetic Predictor Prefetch
tests/integration/defs/examples/test_visual_gen.py
Added three constants (AESTHETIC_PREDICTOR_URL, AESTHETIC_PREDICTOR_FILENAME, AESTHETIC_PREDICTOR_CACHE_DIR), a download helper function with exponential backoff retry logic, and prefetch invocation in repository setup paths to preload weights and avoid GitHub rate limits.
Test Waive Removal
tests/integration/test_lists/waives.txt
Removed six SKIP entries for visual generation dimension score tests (test_vbench_dimension_score_wan*) and performance/accuracy tests related to disaggregated serving and DeepSeek models, indicating these tests are now expected to pass.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: pre-caching aesthetic predictor weights to fix VBench 429 errors, which aligns with the core objective of the PR.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description check ✅ Passed The PR description covers key sections but lacks explicit test coverage details and PR checklist completion.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/integration/defs/examples/test_visual_gen.py`:
- Around line 178-192: Change the broad except around urllib.request.urlretrieve
to catch specific exceptions (urllib.error.URLError and OSError) and protect
against partial downloads by writing to a temporary path (e.g., cached_path +
".tmp") and only renaming/moving the temp file to cached_path on successful
completion; ensure the temp file is removed on failure before retrying and that
the retry/backoff logic using attempt and max_retries remains the same so
corrupted partial files are never left at cached_path.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: ce95000a-8987-4c50-81b6-009f65106071

📥 Commits

Reviewing files that changed from the base of the PR and between bf7142f and c0b7f45.

📒 Files selected for processing (2)
  • tests/integration/defs/examples/test_visual_gen.py
  • tests/integration/test_lists/waives.txt
💤 Files with no reviewable changes (1)
  • tests/integration/test_lists/waives.txt

Comment thread tests/integration/defs/examples/test_visual_gen.py
@chang-l chang-l requested a review from yibinl-nvidia March 12, 2026 00:28
@chang-l
Copy link
Copy Markdown
Collaborator Author

chang-l commented Mar 12, 2026

/bot run --extra-stage "DGX_B200-4_GPUs-PyTorch-Post-Merge-1, DGX_B200-4_GPUs-PyTorch-Post-Merge-2"

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #38663 [ run ] triggered by Bot. Commit: c0b7f45 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #38663 [ run ] completed with state SUCCESS. Commit: c0b7f45
/LLM/main/L0_MergeRequest_PR pipeline #29989 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@chang-l
Copy link
Copy Markdown
Collaborator Author

chang-l commented Mar 13, 2026

/bot run --extra-stage "DGX_B200-4_GPUs-PyTorch-Post-Merge-1, DGX_B200-4_GPUs-PyTorch-Post-Merge-2"

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #38894 [ run ] triggered by Bot. Commit: 63e7514 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #38894 [ run ] completed with state SUCCESS. Commit: 63e7514
/LLM/main/L0_MergeRequest_PR pipeline #30202 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@chang-l
Copy link
Copy Markdown
Collaborator Author

chang-l commented Mar 13, 2026

/bot run --disable-fail-fast --extra-stage "DGX_B200-4_GPUs-PyTorch-Post-Merge-1, DGX_B200-4_GPUs-PyTorch-Post-Merge-2"

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #38899 [ run ] triggered by Bot. Commit: 63e7514 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #38899 [ run ] completed with state FAILURE. Commit: 63e7514
/LLM/main/L0_MergeRequest_PR pipeline #30208 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Use raw.githubusercontent.com CDN URL instead of GitHub blob redirect,
add User-Agent header, increase retries to 8 with longer exponential
backoff (10-120s + jitter), and use atomic file writes to prevent
corruption from partial downloads.

Signed-off-by: Chang Liu <[email protected]>
Signed-off-by: Chang Liu <[email protected]>
@chang-l
Copy link
Copy Markdown
Collaborator Author

chang-l commented Mar 14, 2026

/bot run --disable-fail-fast --extra-stage "DGX_B200-4_GPUs-PyTorch-Post-Merge-1, DGX_B200-4_GPUs-PyTorch-Post-Merge-2"

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #38931 [ run ] triggered by Bot. Commit: 3d6f10d Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #38931 [ run ] completed with state DISABLED
CI server is currently disabled for scheduled maintenance. Estimated completion time: 9 PM PST on 3/14.

Link to invocation

@chang-l
Copy link
Copy Markdown
Collaborator Author

chang-l commented Mar 14, 2026

/bot run --disable-fail-fast --extra-stage "DGX_B200-4_GPUs-PyTorch-Post-Merge-1, DGX_B200-4_GPUs-PyTorch-Post-Merge-2"

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #38933 [ run ] triggered by Bot. Commit: 3d6f10d Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #38933 [ run ] completed with state DISABLED
CI server is currently disabled for scheduled maintenance. Estimated completion time: 9 PM PST on 3/14.

Link to invocation

@chang-l
Copy link
Copy Markdown
Collaborator Author

chang-l commented Mar 14, 2026

/bot run --disable-fail-fast --extra-stage "DGX_B200-4_GPUs-PyTorch-Post-Merge-1, DGX_B200-4_GPUs-PyTorch-Post-Merge-2"

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #38936 [ run ] triggered by Bot. Commit: 3d6f10d Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #38936 [ run ] completed with state DISABLED
CI server is currently disabled for scheduled maintenance. Estimated completion time: 9 PM PST on 3/14.

Link to invocation

@chang-l
Copy link
Copy Markdown
Collaborator Author

chang-l commented Mar 14, 2026

/bot run --disable-fail-fast --extra-stage "DGX_B200-4_GPUs-PyTorch-Post-Merge-1, DGX_B200-4_GPUs-PyTorch-Post-Merge-2"

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #38938 [ run ] triggered by Bot. Commit: 3d6f10d Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #38938 [ run ] completed with state DISABLED
CI server is currently disabled for scheduled maintenance. Estimated completion time: 9 PM PST on 3/14.

Link to invocation

@chang-l
Copy link
Copy Markdown
Collaborator Author

chang-l commented Mar 14, 2026

/bot run --disable-fail-fast --extra-stage "DGX_B200-4_GPUs-PyTorch-Post-Merge-1, DGX_B200-4_GPUs-PyTorch-Post-Merge-2"

@chang-l
Copy link
Copy Markdown
Collaborator Author

chang-l commented Mar 17, 2026

/bot run --extra-stage "DGX_B200-4_GPUs-PyTorch-Post-Merge-1, DGX_B200-4_GPUs-PyTorch-Post-Merge-2"

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #39328 [ run ] triggered by Bot. Commit: 10019f8 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #39328 [ run ] completed with state SUCCESS. Commit: 10019f8
/LLM/main/L0_MergeRequest_PR pipeline #30575 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@chang-l
Copy link
Copy Markdown
Collaborator Author

chang-l commented Mar 18, 2026

/bot run --extra-stage "DGX_B200-4_GPUs-PyTorch-Post-Merge-1, DGX_B200-4_GPUs-PyTorch-Post-Merge-2"

1 similar comment
@chang-l
Copy link
Copy Markdown
Collaborator Author

chang-l commented Mar 18, 2026

/bot run --extra-stage "DGX_B200-4_GPUs-PyTorch-Post-Merge-1, DGX_B200-4_GPUs-PyTorch-Post-Merge-2"

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #39383 [ run ] triggered by Bot. Commit: 10019f8 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #39383 [ run ] completed with state SUCCESS. Commit: 10019f8
/LLM/main/L0_MergeRequest_PR pipeline #30624 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

MediaStorage.save_video() requires the ffmpeg CLI to encode MP4 output.
Install it via apt-get in the _visual_gen_deps test fixture so the
VBench dimension score tests can complete successfully in CI.

Signed-off-by: Chang Liu <[email protected]>
@chang-l
Copy link
Copy Markdown
Collaborator Author

chang-l commented Mar 18, 2026

/bot run --extra-stage "DGX_B200-4_GPUs-PyTorch-Post-Merge-1, DGX_B200-4_GPUs-PyTorch-Post-Merge-2"

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #39503 [ run ] triggered by Bot. Commit: 25401e5 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #39503 [ run ] completed with state SUCCESS. Commit: 25401e5
/LLM/main/L0_MergeRequest_PR pipeline #30724 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

apt-get install without a prior update returns exit code 100 (package
not found) in CI containers with stale package lists.

Signed-off-by: Chang Liu <[email protected]>
@chang-l
Copy link
Copy Markdown
Collaborator Author

chang-l commented Mar 19, 2026

/bot run --extra-stage "DGX_B200-4_GPUs-PyTorch-Post-Merge-1, DGX_B200-4_GPUs-PyTorch-Post-Merge-2"

1 similar comment
@chang-l
Copy link
Copy Markdown
Collaborator Author

chang-l commented Mar 19, 2026

/bot run --extra-stage "DGX_B200-4_GPUs-PyTorch-Post-Merge-1, DGX_B200-4_GPUs-PyTorch-Post-Merge-2"

…ench

Resolve waives.txt conflict: keep both TestNemotronNanoV3 waive and
the VBench test waives from main (added as temporary workaround while
the 429 fix was in progress).

Signed-off-by: Chang Liu <[email protected]>
@chang-l
Copy link
Copy Markdown
Collaborator Author

chang-l commented Mar 19, 2026

/bot run --extra-stage "DGX_B200-4_GPUs-PyTorch-Post-Merge-1, DGX_B200-4_GPUs-PyTorch-Post-Merge-2"

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #39539 [ run ] triggered by Bot. Commit: e3dd019 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #39539 [ run ] completed with state SUCCESS. Commit: e3dd019
/LLM/main/L0_MergeRequest_PR pipeline #30758 completed with status: 'SUCCESS'

CI Report

Link to invocation

Comment thread tests/integration/test_lists/waives.txt Outdated
Comment thread tests/integration/defs/examples/test_visual_gen.py
Remove the 5 test_vbench_dimension_score_* waives (NVBug 5961414) since
this PR fixes the underlying 429 rate-limit and ffmpeg issues that
caused them to fail.

Signed-off-by: Chang Liu <[email protected]>
@chang-l
Copy link
Copy Markdown
Collaborator Author

chang-l commented Mar 20, 2026

/bot run --extra-stage "DGX_B200-4_GPUs-PyTorch-Post-Merge-1, DGX_B200-4_GPUs-PyTorch-Post-Merge-2"

This waive was accidentally added during merge conflict resolution
and does not belong in this PR.

Signed-off-by: Chang Liu <[email protected]>
@chang-l
Copy link
Copy Markdown
Collaborator Author

chang-l commented Mar 20, 2026

/bot run --disable-fail-fast --extra-stage "DGX_B200-4_GPUs-PyTorch-Post-Merge-1, DGX_B200-4_GPUs-PyTorch-Post-Merge-2"

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #39654 [ run ] triggered by Bot. Commit: 262dc4b Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #39658 [ run ] triggered by Bot. Commit: 262dc4b Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #39658 [ run ] completed with state SUCCESS. Commit: 262dc4b
/LLM/main/L0_MergeRequest_PR pipeline #30863 completed with status: 'SUCCESS'

CI Report

Link to invocation

@chang-l chang-l merged commit 98c7683 into NVIDIA:main Mar 20, 2026
5 checks passed
@chang-l
Copy link
Copy Markdown
Collaborator Author

chang-l commented Mar 20, 2026

Also, @yibinl-nvidia , it seems in your PR #12009, the subpath ltx-video-2-0.9.7

LTX2_MODEL_SUBPATH = "ltx-video-2-0.9.7"
doesn't match the actual CI model directory name LTX-2. Could you fix this in a follow-up PR? Right now the LTX2 tests are being skipped because the checkpoint can’t be found.

@yibinl-nvidia
Copy link
Copy Markdown
Collaborator

Also, @yibinl-nvidia , it seems in your PR #12009, the subpath ltx-video-2-0.9.7

LTX2_MODEL_SUBPATH = "ltx-video-2-0.9.7"

doesn't match the actual CI model directory name LTX-2. Could you fix this in a follow-up PR? Right now the LTX2 tests are being skipped because the checkpoint can’t be found.

Thanks Chang for flagging this, should be fixed by #12463

longcheng-nv pushed a commit to longcheng-nv/TensorRT-LLM that referenced this pull request Mar 31, 2026
…o avoid VBench 429 errors (NVIDIA#12127)

Signed-off-by: Chang Liu <[email protected]>
Signed-off-by: Chang Liu <[email protected]>
Signed-off-by: Shreyas Misra <[email protected]>
Co-authored-by: Shreyas Misra <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants