Skip to content

[None][chore] Add disagg local one-step run script for CI submit#14557

Merged
fredricz-20070104 merged 8 commits into
NVIDIA:mainfrom
fredricz-20070104:feature/add_disagg_local_run
May 26, 2026
Merged

[None][chore] Add disagg local one-step run script for CI submit#14557
fredricz-20070104 merged 8 commits into
NVIDIA:mainfrom
fredricz-20070104:feature/add_disagg_local_run

Conversation

@fredricz-20070104
Copy link
Copy Markdown
Collaborator

@fredricz-20070104 fredricz-20070104 commented May 26, 2026

Summary by CodeRabbit

  • New Features

    • Added local disaggregated performance test runner enabling automated testing on SLURM login nodes.
    • Added example configuration template for test customization (Docker image, SLURM settings, test selection, paths).
  • Documentation

    • Expanded README with comprehensive guide covering test execution, configuration variables, multi-test workflows, and feature flags.

Review Change Stack

Add disagg local one-step run script for CI submit.

Comment thread jenkins/scripts/perf/local/run_disagg.sh
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 26, 2026

📝 Walkthrough

Walkthrough

This PR introduces a new local performance testing entrypoint for disaggregated model testing. The change adds documentation, an example configuration template, and a Bash orchestration script that automates running sanity tests on SLURM login nodes without requiring Jenkins.

Changes

Local Performance Test Runner

Layer / File(s) Summary
Local test runner documentation
jenkins/scripts/perf/README.md
README updated with directory overview of new local/run_disagg.sh wrapper and configs/ directory, plus a detailed section covering quick-start instructions, config semantics (sourcing and variable precedence), comprehensive variable reference tables (paths, SLURM, Docker image selection, test identification, installation modes, feature flags, SBATCH option stripping), multi-test work directory layout, and exit behavior on failures.
Configuration schema and example template
jenkins/scripts/perf/local/configs/example.conf
Example Bash config file defining runtime variables: TensorRT-LLM checkout and work directory paths, SLURM partition/account/job naming/time limits, Docker image selection (pinned URL or auto-resolution via image_var from CI properties), model weights path and container bind mounts, test array format and examples, installation mode and wheel path, optional feature flags, and SBATCH option stripping list for cluster compatibility.
Local test orchestration script
jenkins/scripts/perf/local/run_disagg.sh
Bash script that parses -c/--config argument, sources the config file, validates required variables and prerequisites (trtllm directory, wheel path for wheel mode), resolves Docker image (from config or by parsing current_image_tags.properties), creates per-test work directories and job names, calls submit.py to generate per-test SLURM launch scripts, strips unsupported SBATCH directives per config, submits via sbatch, and aggregates submission results with non-zero exit if any tests fail.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description is minimal and lacks required sections from the template, including detailed explanation of the issue/solution, test coverage information, and PR checklist verification. Expand the description to include: what problem the script solves, how it improves the CI workflow, which tests validate it, and confirmation of the PR checklist items.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: adding a disaggregated local one-step run script for CI submission, which matches the primary additions in the changeset.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@jenkins/scripts/perf/local/configs/example.conf`:
- Line 16: The sample config uses
partition="${YOUR_SLURM_PARTITION:-your_partition}" which doesn't match the
runner's preflight check that only rejects the explicit placeholder "CHANGE_ME",
allowing an unedited sample to pass; update the placeholder to use "CHANGE_ME"
(or add "CHANGE_ME" as the default) so the runner's validation in run_disagg.sh
will catch unmodified configs—locate the partition setting in example.conf and
change the default placeholder to "CHANGE_ME" (or make the variable include
"CHANGE_ME") to align with run_disagg.sh's validation.

In `@jenkins/scripts/perf/local/run_disagg.sh`:
- Around line 22-25: The -c|--config) branch in run_disagg.sh blindly assigns
config_file="$2" and shift 2, which will fail if -c is the last arg or the next
token is another option; update that branch (the -c|--config) case) to validate
the value before assigning: check that "$2" is non-empty and does not start with
'-' (or otherwise looks like an option), and if the check fails print a clear
usage/error message and exit non‑zero; only then set config_file="$2" and shift
2.
- Around line 96-99: Add a preflight existence check for the llm_models_path
variable similar to the trtllm check: verify that $llm_models_path is set and
points to an existing directory, print a clear error to stderr (e.g., "ERROR:
llm_models_path directory not found: $llm_models_path") and exit non-zero if it
fails; place this check before any submission or runtime steps that assume
llm_models_path is present so failures occur early (mirror the pattern used for
the trtllm directory check).
- Around line 100-110: Add an explicit validation for the install_mode variable
before the existing wheel-specific block: check that install_mode is either
"wheel" or "source", and if not print a clear error to stderr and exit non-zero;
keep referencing the same variables used in the diff (install_mode and
wheel_path) and place this check prior to the existing if [[ "$install_mode" ==
"wheel" ]] block so invalid values fail early.

In `@jenkins/scripts/perf/README.md`:
- Around line 198-203: The Markdown table starting with "| Cluster |
`strip_sbatch_opts` |" needs surrounding blank lines to satisfy markdownlint
MD058; add one blank line immediately before the line "| Cluster |
`strip_sbatch_opts` |" and one blank line immediately after the final table row
("| B200 / GB200 modern clusters | \"--segment\" (or `\"\"` if even `--segment`
works) |") so the table is separated from adjacent paragraphs.
- Around line 149-151: Update the README fenced code blocks to include language
identifiers (e.g., "text") so markdownlint MD040 is satisfied: locate the two
fenced blocks showing the pytest param string
"perf/test_perf_sanity.py::test_e2e[<runtime>-<mode>-<yaml-stem>[-<server-cfg>]]"
and the directory tree starting with "$work_dir/" and change their opening
fences from ``` to ```text (and similarly add language tags for any other code
fences in the same section referenced around lines 208-220).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 4325f19a-9d2d-429f-96ce-a36e72298726

📥 Commits

Reviewing files that changed from the base of the PR and between 92ac6fc and fc9b963.

📒 Files selected for processing (4)
  • .gitignore
  • jenkins/scripts/perf/README.md
  • jenkins/scripts/perf/local/configs/example.conf
  • jenkins/scripts/perf/local/run_disagg.sh

Comment thread jenkins/scripts/perf/local/configs/example.conf
Comment thread jenkins/scripts/perf/local/run_disagg.sh
Comment thread jenkins/scripts/perf/local/run_disagg.sh
Comment thread jenkins/scripts/perf/local/run_disagg.sh
Comment thread jenkins/scripts/perf/README.md
Comment thread jenkins/scripts/perf/README.md
@fredricz-20070104
Copy link
Copy Markdown
Collaborator Author

/bot skip --comment "skip test as just add ci local submit mode"

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #50316 [ skip ] triggered by Bot. Commit: 3f3daf3 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #50316 [ skip ] completed with state SUCCESS. Commit: 3f3daf3
Skipping testing for commit 3f3daf3

Link to invocation

@fredricz-20070104 fredricz-20070104 merged commit 17cf970 into NVIDIA:main May 26, 2026
7 checks passed
bmarimuthu-nv pushed a commit to nv-auto-deploy/TensorRT-LLM that referenced this pull request May 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants