[None][fix] Isolate ray tests to avoid GCS timeout in one pytest session by shuyixiong · Pull Request #14342 · NVIDIA/TensorRT-LLM

shuyixiong · 2026-05-20T06:17:10Z

Summary by CodeRabbit

Tests
- Reorganized multi-GPU test execution into partitioned runs for improved test distribution.
- Expanded GPU operation test coverage with additional test configurations.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
If PR introduces API changes, an appropriate PR label is added - either api-compatible or api-breaking. For api-breaking, include BREAKING in the PR title.
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

Signed-off-by: shuyixiong <[email protected]>

coderabbitai · 2026-05-20T06:21:35Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: a74e8970-6426-4167-b5ae-bbde7774fb85

📥 Commits

Reviewing files that changed from the base of the PR and between aac0d65 and 3f5e425.

📒 Files selected for processing (4)

tests/integration/test_lists/test-db/l0_dgx_b200.yml
tests/integration/test_lists/test-db/l0_dgx_h100.yml
tests/unittest/_torch/ray_orchestrator/multi_gpu/test_llm_update_weights_multi_gpu.py
tests/unittest/_torch/ray_orchestrator/multi_gpu/test_ops.py

📝 Walkthrough

Walkthrough

This PR introduces test partitioning for Ray-orchestrated multi-GPU operations. Pytest part markers divide update-weights and collective-ops tests into independent numbered segments, allowing the integration test harness to invoke each partition separately rather than running full test modules, improving test parallelization on GPU hardware.

Changes

Test partitioning for parallel GPU test execution

Layer / File(s)	Summary
Update-weights test partitioning `tests/unittest/_torch/ray_orchestrator/multi_gpu/test_llm_update_weights_multi_gpu.py`	Four FP8 and NVFP4 weight-update tests receive `@pytest.mark.part0` through `part3` decorators to partition execution into separate invocable test segments.
Ray ops test partitioning `tests/unittest/_torch/ray_orchestrator/multi_gpu/test_ops.py`	Five Ray multi-GPU collective operation tests transition from `@pytest.mark.gpu2` to `@pytest.mark.part0` through `part4`, enabling independent scheduling for allgather, reducescatter, allreduce, and broadcast operations.
Integration test list updates `tests/integration/test_lists/test-db/l0_dgx_b200.yml`, `tests/integration/test_lists/test-db/l0_dgx_h100.yml`	YAML test configurations replace single test entries with four partitioned update-weights invocations and add five partitioned ops test entries, wiring part markers into the l0_dgx_b200 and l0_dgx_h100 test matrices.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The PR description is incomplete; it only contains the repository template with placeholder sections and no actual details about the issue, solution, or test coverage.	Add specific details under 'Description' and 'Test Coverage' sections explaining the GCS timeout issue, why test isolation solves it, and which test lists and markers ensure the fix works.
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly indicates the main change: isolating Ray tests to avoid GCS timeouts in a single pytest session, matching the changeset modifications.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

shuyixiong · 2026-05-20T12:56:41Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-05-20T13:02:31Z

PR_Github #49432 [ run ] triggered by Bot. Commit: 3f5e425 Link to invocation

brb-nv

LGTM.

tensorrt-cicd · 2026-05-20T23:14:16Z

PR_Github #49432 [ run ] completed with state SUCCESS. Commit: 3f5e425
/LLM/main/L0_MergeRequest_PR pipeline #39077 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

shuyixiong · 2026-05-21T03:38:55Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-05-21T03:46:12Z

PR_Github #49569 [ run ] triggered by Bot. Commit: 3f5e425 Link to invocation

tensorrt-cicd · 2026-05-21T12:23:50Z

PR_Github #49569 [ run ] completed with state SUCCESS. Commit: 3f5e425
/LLM/main/L0_MergeRequest_PR pipeline #39196 completed with status: 'SUCCESS'

CI Report

Link to invocation

…ion (NVIDIA#14342) Signed-off-by: shuyixiong <[email protected]>

Isolate ray tests to avoid GCS timeout in one pytest session

3f5e425

Signed-off-by: shuyixiong <[email protected]>

github-actions Bot assigned shuyixiong May 20, 2026

brb-nv approved these changes May 20, 2026

View reviewed changes

brb-nv enabled auto-merge (squash) May 20, 2026 16:50

brb-nv merged commit 709e9c8 into NVIDIA:main May 21, 2026
14 of 15 checks passed

xxi-nv pushed a commit to xxi-nv/TensorRT-LLM that referenced this pull request May 22, 2026

[None][fix] Isolate ray tests to avoid GCS timeout in one pytest sess…

c459096

…ion (NVIDIA#14342) Signed-off-by: shuyixiong <[email protected]>

KleinBlueC pushed a commit to KleinBlueC/TensorRT-LLM that referenced this pull request May 26, 2026

[None][fix] Isolate ray tests to avoid GCS timeout in one pytest sess…

d217881

…ion (NVIDIA#14342) Signed-off-by: shuyixiong <[email protected]>

bmarimuthu-nv pushed a commit to nv-auto-deploy/TensorRT-LLM that referenced this pull request May 28, 2026

[None][fix] Isolate ray tests to avoid GCS timeout in one pytest sess…

06ada3e

…ion (NVIDIA#14342) Signed-off-by: shuyixiong <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[None][fix] Isolate ray tests to avoid GCS timeout in one pytest session#14342

[None][fix] Isolate ray tests to avoid GCS timeout in one pytest session#14342
brb-nv merged 1 commit into
NVIDIA:mainfrom
shuyixiong:user/shuyix/isolate_ray_tests2

shuyixiong commented May 20, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 20, 2026

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (2 warnings)

Uh oh!

shuyixiong commented May 20, 2026

Uh oh!

tensorrt-cicd commented May 20, 2026

Uh oh!

brb-nv left a comment

Uh oh!

tensorrt-cicd commented May 20, 2026

Uh oh!

shuyixiong commented May 21, 2026

Uh oh!

tensorrt-cicd commented May 21, 2026

Uh oh!

tensorrt-cicd commented May 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

shuyixiong commented May 20, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

coderabbitai Bot commented May 20, 2026

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (2 warnings)

Uh oh!

shuyixiong commented May 20, 2026

Uh oh!

tensorrt-cicd commented May 20, 2026

Uh oh!

brb-nv left a comment

Choose a reason for hiding this comment

Uh oh!

tensorrt-cicd commented May 20, 2026

Uh oh!

shuyixiong commented May 21, 2026

Uh oh!

tensorrt-cicd commented May 21, 2026

Uh oh!

tensorrt-cicd commented May 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

shuyixiong commented May 20, 2026 •

edited by coderabbitai Bot

Loading