Skip to content

[None][fix] Isolate ray tests to avoid GCS timeout in one pytest session#14342

Merged
brb-nv merged 1 commit into
NVIDIA:mainfrom
shuyixiong:user/shuyix/isolate_ray_tests2
May 21, 2026
Merged

[None][fix] Isolate ray tests to avoid GCS timeout in one pytest session#14342
brb-nv merged 1 commit into
NVIDIA:mainfrom
shuyixiong:user/shuyix/isolate_ray_tests2

Conversation

@shuyixiong
Copy link
Copy Markdown
Collaborator

@shuyixiong shuyixiong commented May 20, 2026

Summary by CodeRabbit

  • Tests
    • Reorganized multi-GPU test execution into partitioned runs for improved test distribution.
    • Expanded GPU operation test coverage with additional test configurations.

Review Change Stack

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • If PR introduces API changes, an appropriate PR label is added - either api-compatible or api-breaking. For api-breaking, include BREAKING in the PR title.

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 20, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: a74e8970-6426-4167-b5ae-bbde7774fb85

📥 Commits

Reviewing files that changed from the base of the PR and between aac0d65 and 3f5e425.

📒 Files selected for processing (4)
  • tests/integration/test_lists/test-db/l0_dgx_b200.yml
  • tests/integration/test_lists/test-db/l0_dgx_h100.yml
  • tests/unittest/_torch/ray_orchestrator/multi_gpu/test_llm_update_weights_multi_gpu.py
  • tests/unittest/_torch/ray_orchestrator/multi_gpu/test_ops.py

📝 Walkthrough

Walkthrough

This PR introduces test partitioning for Ray-orchestrated multi-GPU operations. Pytest part markers divide update-weights and collective-ops tests into independent numbered segments, allowing the integration test harness to invoke each partition separately rather than running full test modules, improving test parallelization on GPU hardware.

Changes

Test partitioning for parallel GPU test execution

Layer / File(s) Summary
Update-weights test partitioning
tests/unittest/_torch/ray_orchestrator/multi_gpu/test_llm_update_weights_multi_gpu.py
Four FP8 and NVFP4 weight-update tests receive @pytest.mark.part0 through part3 decorators to partition execution into separate invocable test segments.
Ray ops test partitioning
tests/unittest/_torch/ray_orchestrator/multi_gpu/test_ops.py
Five Ray multi-GPU collective operation tests transition from @pytest.mark.gpu2 to @pytest.mark.part0 through part4, enabling independent scheduling for allgather, reducescatter, allreduce, and broadcast operations.
Integration test list updates
tests/integration/test_lists/test-db/l0_dgx_b200.yml, tests/integration/test_lists/test-db/l0_dgx_h100.yml
YAML test configurations replace single test entries with four partitioned update-weights invocations and add five partitioned ops test entries, wiring part markers into the l0_dgx_b200 and l0_dgx_h100 test matrices.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description is incomplete; it only contains the repository template with placeholder sections and no actual details about the issue, solution, or test coverage. Add specific details under 'Description' and 'Test Coverage' sections explaining the GCS timeout issue, why test isolation solves it, and which test lists and markers ensure the fix works.
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly indicates the main change: isolating Ray tests to avoid GCS timeouts in a single pytest session, matching the changeset modifications.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@shuyixiong
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #49432 [ run ] triggered by Bot. Commit: 3f5e425 Link to invocation

Copy link
Copy Markdown
Collaborator

@brb-nv brb-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@brb-nv brb-nv enabled auto-merge (squash) May 20, 2026 16:50
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #49432 [ run ] completed with state SUCCESS. Commit: 3f5e425
/LLM/main/L0_MergeRequest_PR pipeline #39077 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@shuyixiong
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #49569 [ run ] triggered by Bot. Commit: 3f5e425 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #49569 [ run ] completed with state SUCCESS. Commit: 3f5e425
/LLM/main/L0_MergeRequest_PR pipeline #39196 completed with status: 'SUCCESS'

CI Report

Link to invocation

@brb-nv brb-nv merged commit 709e9c8 into NVIDIA:main May 21, 2026
14 of 15 checks passed
xxi-nv pushed a commit to xxi-nv/TensorRT-LLM that referenced this pull request May 22, 2026
KleinBlueC pushed a commit to KleinBlueC/TensorRT-LLM that referenced this pull request May 26, 2026
bmarimuthu-nv pushed a commit to nv-auto-deploy/TensorRT-LLM that referenced this pull request May 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants