[https://nvbugs/6001694][fix] Add CUDA profiler API scoping for visual gen nsys profiling#12432
Conversation
📝 WalkthroughWalkthroughAdds CUDA profiling capabilities to the visual generation pipeline by introducing profiler state management, parsing environment variable configuration for profiling ranges, and instrumenting the warmup, denoise, decode, and cleanup phases to conditionally record performance metrics. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (3)
tensorrt_llm/_torch/visual_gen/pipeline.py (3)
28-30: Replace en-dash characters with standard hyphens in docstring.Static analysis flagged ambiguous "–" (en-dash) characters. Use "-" (hyphen-minus) for consistency and to avoid potential issues with ASCII-only tooling.
🔧 Proposed fix
- * ``A-B`` – profile denoise steps A through B (same ``A-B`` format as LLM path) - * ``all`` – profile the full generation forward (denoise + VAE), skip warmup - * (unset) – no profiler API calls; plain ``nsys profile`` captures everything + * ``A-B`` - profile denoise steps A through B (same ``A-B`` format as LLM path) + * ``all`` - profile the full generation forward (denoise + VAE), skip warmup + * (unset) - no profiler API calls; plain ``nsys profile`` captures everything🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/visual_gen/pipeline.py` around lines 28 - 30, Docstring uses en-dash characters (–) in the bullet list which static analysis flagged; replace them with standard hyphen-minus (-) characters. Edit the docstring in tensorrt_llm._torch.visual_gen.pipeline (the multiline comment containing the "A-B", "all" and "(unset)" bullets) and change each "–" to "-" so entries read "* ``A-B`` - profile...", "* ``all`` - profile...", "* (unset) - no profiler...". Ensure only the punctuation characters are replaced and formatting/markdown backticks remain unchanged.
41-46: Consider adding error handling for malformed input.If the user provides a malformed value like
"1-2-3"or"foo", the current code will raise an unhandledValueError. This could be confusing for users. Consider adding validation with a clearer error message, or documenting the expected failure behavior.🛡️ Optional: Add error handling
# A-B format (same parser as LLM path) if "-" in val: - start, stop = val.split("-") - return int(start), int(stop) + parts = val.split("-") + if len(parts) != 2: + raise ValueError( + f"Invalid TLLM_PROFILE_START_STOP format: '{val}'. " + "Expected 'A-B', 'all', or a single step number." + ) + try: + return int(parts[0]), int(parts[1]) + except ValueError: + raise ValueError( + f"Invalid TLLM_PROFILE_START_STOP range: '{val}'. " + "Start and stop must be integers." + ) # Single step - v = int(val) - return v, v + try: + v = int(val) + return v, v + except ValueError: + raise ValueError( + f"Invalid TLLM_PROFILE_START_STOP value: '{val}'. " + "Expected 'A-B', 'all', or a single step number." + )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/visual_gen/pipeline.py` around lines 41 - 46, The parsing of the range string in the block that reads the variable val currently assumes either "start-stop" or a single integer and will raise an unhandled ValueError for inputs like "1-2-3" or "foo"; update this code to validate the format and provide a clear error when malformed: if "-" in val, split and ensure there are exactly two parts, attempt to cast both parts to int inside a try/except and raise a ValueError with a descriptive message if casting fails or parts count != 2; for the single-value branch, wrap int(val) in try/except and raise a clear ValueError on failure. Ensure you keep the same return semantics (start, stop) on success and reference the same variable names (val, start, stop) so callers remain unchanged.
535-545: Minor style improvements suggested by static analysis.
- Line 543: Loop variable
nameis unused - rename to_name- Lines 535, 545: Use tuple unpacking instead of concatenation
♻️ Proposed style fixes
if extra_latents: extra_results = [] - for name, (extra_latent, extra_decode_fn) in extra_latents.items(): + for _name, (extra_latent, extra_decode_fn) in extra_latents.items(): extra_results.append(extra_decode_fn(extra_latent)) - result = (primary_result,) + tuple(extra_results) + result = (primary_result, *extra_results)And similarly for line 535:
if extra_latents: extra_results = [efn(elat) for _, (elat, efn) in extra_latents.items()] - result = (primary_result,) + tuple(extra_results) + result = (primary_result, *extra_results)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/visual_gen/pipeline.py` around lines 535 - 545, Rename the unused loop variable name to _name in the extra_latents loop inside the branch where self.rank == 0, and replace tuple concatenation usages with tuple unpacking for clarity: when collecting extra_results (the block that calls extra_decode_fn on extra_latent) change the loop to for _name, (extra_latent, extra_decode_fn) in extra_latents.items(), and where the code currently builds result using (primary_result,) + tuple(extra_results) (and the symmetric case above), use (primary_result, *extra_results) instead; keep the else case returning primary_result unchanged. Ensure references are to decode_fn, extra_latents, extra_results, primary_result, and result.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@tensorrt_llm/_torch/visual_gen/pipeline.py`:
- Around line 28-30: Docstring uses en-dash characters (–) in the bullet list
which static analysis flagged; replace them with standard hyphen-minus (-)
characters. Edit the docstring in tensorrt_llm._torch.visual_gen.pipeline (the
multiline comment containing the "A-B", "all" and "(unset)" bullets) and change
each "–" to "-" so entries read "* ``A-B`` - profile...", "* ``all`` -
profile...", "* (unset) - no profiler...". Ensure only the punctuation
characters are replaced and formatting/markdown backticks remain unchanged.
- Around line 41-46: The parsing of the range string in the block that reads the
variable val currently assumes either "start-stop" or a single integer and will
raise an unhandled ValueError for inputs like "1-2-3" or "foo"; update this code
to validate the format and provide a clear error when malformed: if "-" in val,
split and ensure there are exactly two parts, attempt to cast both parts to int
inside a try/except and raise a ValueError with a descriptive message if casting
fails or parts count != 2; for the single-value branch, wrap int(val) in
try/except and raise a clear ValueError on failure. Ensure you keep the same
return semantics (start, stop) on success and reference the same variable names
(val, start, stop) so callers remain unchanged.
- Around line 535-545: Rename the unused loop variable name to _name in the
extra_latents loop inside the branch where self.rank == 0, and replace tuple
concatenation usages with tuple unpacking for clarity: when collecting
extra_results (the block that calls extra_decode_fn on extra_latent) change the
loop to for _name, (extra_latent, extra_decode_fn) in extra_latents.items(), and
where the code currently builds result using (primary_result,) +
tuple(extra_results) (and the symmetric case above), use (primary_result,
*extra_results) instead; keep the else case returning primary_result unchanged.
Ensure references are to decode_fn, extra_latents, extra_results,
primary_result, and result.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: a4170482-e0ff-426b-aab9-771c10375016
📒 Files selected for processing (1)
tensorrt_llm/_torch/visual_gen/pipeline.py
3d74db2 to
036b805
Compare
|
/bot run |
|
PR_Github #39842 [ run ] triggered by Bot. Commit: |
|
/bot run |
|
PR_Github #39842 [ run ] completed with state
|
|
PR_Github #39846 [ run ] triggered by Bot. Commit: |
|
/bot run |
|
PR_Github #39857 [ run ] triggered by Bot. Commit: |
|
/bot run |
|
PR_Github #39893 [ run ] triggered by Bot. Commit: |
|
PR_Github #39893 [ run ] completed with state |
ccc9333 to
d2414ab
Compare
…l gen nsys profiling
Add TLLM_PROFILE_START_STOP support to visual gen pipelines, following the
same pattern as the LLM path (PyExecutor). This enables proper nsys GPU
kernel capture during generation by scoping CUPTI instrumentation via
cudaProfilerStart/Stop.
Root cause: Without scoping, nsys/CUPTI continuously instruments the full
pipeline including warmup (79K+ kernel launches for a 14B model). The
sustained CUPTI overhead causes CUDA "Command Buffer Full" events to
accumulate, eventually degrading CUPTI's kernel activity tracing. By the
time actual generation starts, GPU kernel events are silently dropped
while CPU NVTX markers continue to work — creating the misleading profile
reported in the bug.
The fix uses the existing TLLM_PROFILE_START_STOP env var with
nsys -c cudaProfilerApi to start CUPTI fresh at generation time:
TLLM_PROFILE_START_STOP=all nsys profile -c cudaProfilerApi \
--capture-range-end=stop-shutdown ...
→ captures denoise + VAE decode, skips warmup
TLLM_PROFILE_START_STOP=0-4 nsys profile -c cudaProfilerApi \
--capture-range-end=stop ...
→ captures only denoise steps 0 through 4
Verified on B200 (single-GPU and 4-GPU cfg=2 ulysses=2) with
Wan2.2-T2V-A14B: all denoise steps + VAE decode captured with correct
GPU kernel events.
Signed-off-by: Chang Liu <[email protected]>
…STOP in visual gen profiler Update _parse_profile_range() to accept comma-separated ranges (e.g. "0-4,10-14") matching the LLM path's _load_iteration_indexes format. Returns (frozenset(starts), frozenset(stops)) instead of a single (start, stop) tuple; update the denoise loop accordingly. Signed-off-by: Chang Liu <[email protected]>
…eat:N nsys recipe Signed-off-by: Chang Liu <[email protected]>
…T_STOP, add predenoise/postdenoise modes Signed-off-by: Chang Liu <[email protected]>
30d9f1d to
3895e52
Compare
|
/bot reuse-pipeline |
|
PR_Github #46429 [ reuse-pipeline ] triggered by Bot. Commit: |
|
PR_Github #46429 [ reuse-pipeline ] completed with state |
|
/bot run |
|
PR_Github #46433 [ run ] triggered by Bot. Commit: |
|
PR_Github #46433 [ run ] completed with state
|
|
/bot run |
|
PR_Github #46662 [ run ] triggered by Bot. Commit: |
|
PR_Github #46662 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #46698 [ run ] triggered by Bot. Commit: |
|
PR_Github #46698 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
/bot run |
|
PR_Github #46745 [ run ] triggered by Bot. Commit: |
|
PR_Github #46745 [ run ] completed with state
|
|
/bot run |
|
PR_Github #46791 [ run ] triggered by Bot. Commit: |
|
PR_Github #46791 [ run ] completed with state |
…l gen nsys profiling (NVIDIA#12432) Signed-off-by: Chang Liu <[email protected]>
Summary
TLLM_PROFILE_VISUAL_GEN_START_STOPenv var support in visual genBasePipeline. Same scoping pattern as the LLM path (TLLM_PROFILE_START_STOP) but visual-gen-specific so the two backends can diverge cleanly. Use withnsys profile -c cudaProfilerApi ...to scope CUPTI instrumentation viacudaProfilerStart/cudaProfilerStop.A-Bform are per-request (eachdenoise()call resets the counter to 0). This differs from the LLM path's global executor iteration counter, where one forward pass services all in-flight requests.predenoise— single-shot capture of per-request pre-loop work insidedenoise()(CFG config setup, scheduler refresh, TeaCache reset) up to the first denoise step.postdenoise— single-shot capture from end of last denoise step to pipeline cleanup, covering VAE decode.Usage
Test plan
TLLM_PROFILE_VISUAL_GEN_START_STOP=0-1: captures only denoise steps 0-1 (5,154 kernels)TLLM_PROFILE_VISUAL_GEN_START_STOP=all: captures full generation including VAE decode (18,117 kernels)nsys profile: no behavioral change (backward compatible)trtllm-servewith--capture-range-end=repeat:5on Wan2.1-T2V-1.3B (B200): 3 sequential POSTs to/v1/videos/generationsproduced 3 separate.nsys-repfiles, each containing exactly the configureddenoise_step 0/1/2NVTX ranges — confirming step indices reset per request andrepeat:Nis the correct flag for serve-style workloads.predenoisemode undertrtllm-serveon Wan2.1-T2V-1.3B (B200): request 200 OK, captured.nsys-repcontains the per-request pre-loop CUDA work (CFG-embedCatArrayBatchedCopy) and 0denoise_stepNVTX ranges.postdenoisemode undertrtllm-serveon Wan2.1-T2V-1.3B (B200): request 200 OK, captured.nsys-repcontains 1_decode_latentsNVTX range (~0.78s) plus cuDNN VAE convolution kernels and 0denoise_stepNVTX ranges.all+stop-shutdown+python visual_gen_wan_t2v.py+-t cuda,nvtx,osrt,cudnn,cublas:.nsys-repcontains 50_scheduler_stepinstances, 50 uniquedenoise_step Nindices, and_decode_latents(~376 ms VAE).0-4+stop+python visual_gen_wan_t2v.py+-t cuda,nvtx,osrt,cudnn,cublas: script exits 0;.nsys-repcontains exactly 5_scheduler_stepinstances anddenoise_step 0/1/2/3/4(no other indices).0-4+repeat:5+trtllm-serve+-t cuda,nvtx,osrt,cudnn,cublaswith 3 sequential POSTs (44s/35s/21s, all 200 OK): produced 3 separate.nsys-repfiles; each contains exactlydenoise_step 0/1/2/3/4, confirming per-request scoping for the literal recipe.🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Chores