Skip to content

[None][feat] Move DWDP contention optimization into DwdpConfig#12974

Merged
JintaoPengCS merged 12 commits into
NVIDIA:mainfrom
JintaoPengCS:dwdp_production_pr2
Apr 23, 2026
Merged

[None][feat] Move DWDP contention optimization into DwdpConfig#12974
JintaoPengCS merged 12 commits into
NVIDIA:mainfrom
JintaoPengCS:dwdp_production_pr2

Conversation

@JintaoPengCS
Copy link
Copy Markdown
Collaborator

@JintaoPengCS JintaoPengCS commented Apr 13, 2026

Move DWDP contention optimization from environment variable (TRTM_dwdp_contention_opt) to a proper DwdpConfig field. When enabled, uses cudaMemcpyBatchAsync with round-robin slice ordering and 2 MiB slice granularity to reduce NVLink contention during expert prefetch.

Summary by CodeRabbit

Release Notes

  • New Features

    • Added contention_opt configuration option to DWDP settings for memory optimization control.
  • Performance

    • Enhanced prefetch operations with optimized memory copying mechanism to reduce NVLink contention when enabled.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

Move DWDP contention optimization from environment variable
(TRTM_dwdp_contention_opt) to a proper DwdpConfig field. When enabled,
uses cudaMemcpyBatchAsync with round-robin slice ordering and 2 MiB
slice granularity to reduce NVLink contention during expert prefetch.

Signed-off-by: jintaop <[email protected]>
Signed-off-by: JtaoPeng <[email protected]>
@JintaoPengCS JintaoPengCS changed the title [None][feat] Add contention_opt to DwdpConfig for batched prefetch **[JIRA ticket/NVBugs ID/GitHub issue/None] @coderabbitai title** Apr 13, 2026
@JintaoPengCS JintaoPengCS changed the title **[JIRA ticket/NVBugs ID/GitHub issue/None] @coderabbitai title** [None][feat] Move DWDP contention optimization into DwdpConfig Apr 13, 2026
@JintaoPengCS JintaoPengCS marked this pull request as ready for review April 13, 2026 05:50
@JintaoPengCS JintaoPengCS requested review from a team as code owners April 13, 2026 05:50
@JintaoPengCS JintaoPengCS requested a review from syuoni April 13, 2026 05:50
@JintaoPengCS
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 13, 2026

📝 Walkthrough

Walkthrough

This PR introduces a batched memory copy optimization for the DWDP module. A new configuration option contention_opt enables an alternative prefetch mechanism that batches multiple remote memory copies into a single cudaMemcpyBatchAsync call with round-robin slice interleaving, replacing individual cudaMemcpyAsync operations per peer.

Changes

Cohort / File(s) Summary
Configuration
tensorrt_llm/llmapi/llm_args.py
Added contention_opt boolean field to DwdpConfig (default False) to enable limited-contention prefetch optimization with 2 MiB batched slice granularity.
Batched Prefetch Implementation
tensorrt_llm/_torch/pyexecutor/dwdp.py
Introduced global MEMCPY_BATCH_SLICE_BYTES constant and per-layer prefetch plan caching. Added conditional logic to select between baseline _prefetch_layer_default() (original per-peer async copies) and new _prefetch_layer_batched() (single batched async call with round-robin slice interleaving). Refactored prefetch_layer() to route based on use_batched_prefetch flag.

Sequence Diagram(s)

sequenceDiagram
    participant DM as DwdpManager
    participant CUDA as CUDA Stream
    participant MEM as GPU Memory

    rect rgba(100, 150, 200, 0.5)
    Note over DM,MEM: Baseline Path (use_batched_prefetch=False)
    DM->>DM: prefetch_layer()
    DM->>DM: _prefetch_layer_default()
    loop For each remote peer
        DM->>CUDA: cudaMemcpyAsync(full region)
        CUDA->>MEM: Copy entire num_prefetch_experts
    end
    CUDA->>CUDA: Synchronize
    end

    rect rgba(150, 200, 100, 0.5)
    Note over DM,MEM: Optimized Path (use_batched_prefetch=True)
    DM->>DM: prefetch_layer()
    DM->>DM: _build_batched_prefetch_copy_plan()
    DM->>DM: Split copies into MEMCPY_BATCH_SLICE_BYTES slices
    DM->>DM: Interleave slices round-robin across peers
    DM->>CUDA: cudaMemcpyBatchAsync(assembled pointers/sizes)
    CUDA->>MEM: Copy interleaved slices in batched call
    CUDA->>CUDA: Synchronize
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description explains the change and motivation, but critical template sections like 'Test Coverage' and 'Description' are incomplete or only partially filled. Complete the 'Description' section with detailed explanation and fill in the 'Test Coverage' section with relevant test cases that safeguard the changes.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: moving DWDP contention optimization from environment variable to DwdpConfig field.
Docstring Coverage ✅ Passed Docstring coverage is 85.71% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
tensorrt_llm/_torch/pyexecutor/dwdp.py (2)

265-269: Redundant global reassignment.

The MEMCPY_BATCH_SLICE_BYTES global is already initialized to 1 << 21 at module level (line 25). Reassigning it to the same value when contention_opt is enabled has no effect. Consider removing this block unless the intent is to allow configuration of different slice sizes in the future.

♻️ Suggested simplification
         self.use_batched_prefetch = config.contention_opt
-
-        if config.contention_opt:
-            global MEMCPY_BATCH_SLICE_BYTES
-            MEMCPY_BATCH_SLICE_BYTES = 1 << 21
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/pyexecutor/dwdp.py` around lines 265 - 269, The global
reassignment of MEMCPY_BATCH_SLICE_BYTES inside the dwdp.PyExecutor constructor
is redundant because MEMCPY_BATCH_SLICE_BYTES is already initialized to 1 << 21
at module level; remove the block that sets MEMCPY_BATCH_SLICE_BYTES when
config.contention_opt is true (the lines referencing global
MEMCPY_BATCH_SLICE_BYTES and setting it to 1 << 21) or, if you intend to allow a
different slice size via config, replace that block with reading a new config
property (e.g., config.memcpy_batch_slice_bytes) and assign it to
MEMCPY_BATCH_SLICE_BYTES accordingly, ensuring you reference the same global
symbol in PyExecutor and document the config option.

620-621: Inconsistent None check between prefetch paths.

The batched path (_prefetch_layer_batched_build_batched_prefetch_copy_plan) includes a defensive None check for dst_tensor (lines 584-587), but the default path here does not. While dst_tensor shouldn't be None for peer_rank != self.dwdp_rank based on buffer initialization logic, consider adding the same guard for consistency and defensive coding.

🛡️ Suggested defensive check
             dst_tensor = self.prefetch_buffer.buffers[buffer_idx][param_name][peer_rank]
+            if dst_tensor is None:
+                raise RuntimeError(
+                    f"Missing prefetch buffer for peer_rank={peer_rank} param={param_name}"
+                )
             dst_ptr = dst_tensor.data_ptr()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/pyexecutor/dwdp.py` around lines 620 - 621, Add the same
defensive None check used in the batched prefetch path here: when retrieving
dst_tensor via self.prefetch_buffer.buffers[buffer_idx][param_name][peer_rank]
inside the non-batched prefetch flow, verify dst_tensor is not None before
calling dst_tensor.data_ptr(); if it is None, skip this copy/continue (or handle
the missing buffer the same way
_prefetch_layer_batched/_build_batched_prefetch_copy_plan do). This change
should reference dst_tensor, buffer_idx, param_name, peer_rank and keep behavior
consistent with _prefetch_layer_batched/_build_batched_prefetch_copy_plan.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tensorrt_llm/_torch/pyexecutor/dwdp.py`:
- Around line 651-660: The call to cudart.cudaMemcpyBatchAsync requires CUDA
12.8+; update the code around the cudaMemcpyBatchAsync invocation (reference
cudart.cudaMemcpyBatchAsync and self.prefetch_buffer.prefetch_stream.cuda_stream
in dwdp.py) to either document this requirement in the module docstring/comments
and/or add a runtime check (use the CUDA runtime/version API such as
cudart.cudaRuntimeGetVersion or similar) before calling cudaMemcpyBatchAsync and
raise a clear, actionable error if the detected CUDA version is less than 12.8
(e.g., advise upgrading CUDA or fallback to a supported memcpy path).

---

Nitpick comments:
In `@tensorrt_llm/_torch/pyexecutor/dwdp.py`:
- Around line 265-269: The global reassignment of MEMCPY_BATCH_SLICE_BYTES
inside the dwdp.PyExecutor constructor is redundant because
MEMCPY_BATCH_SLICE_BYTES is already initialized to 1 << 21 at module level;
remove the block that sets MEMCPY_BATCH_SLICE_BYTES when config.contention_opt
is true (the lines referencing global MEMCPY_BATCH_SLICE_BYTES and setting it to
1 << 21) or, if you intend to allow a different slice size via config, replace
that block with reading a new config property (e.g.,
config.memcpy_batch_slice_bytes) and assign it to MEMCPY_BATCH_SLICE_BYTES
accordingly, ensuring you reference the same global symbol in PyExecutor and
document the config option.
- Around line 620-621: Add the same defensive None check used in the batched
prefetch path here: when retrieving dst_tensor via
self.prefetch_buffer.buffers[buffer_idx][param_name][peer_rank] inside the
non-batched prefetch flow, verify dst_tensor is not None before calling
dst_tensor.data_ptr(); if it is None, skip this copy/continue (or handle the
missing buffer the same way
_prefetch_layer_batched/_build_batched_prefetch_copy_plan do). This change
should reference dst_tensor, buffer_idx, param_name, peer_rank and keep behavior
consistent with _prefetch_layer_batched/_build_batched_prefetch_copy_plan.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 70f5fe9a-55d6-498f-aeea-677d045b5740

📥 Commits

Reviewing files that changed from the base of the PR and between 28cf4f5 and 657b225.

📒 Files selected for processing (2)
  • tensorrt_llm/_torch/pyexecutor/dwdp.py
  • tensorrt_llm/llmapi/llm_args.py

Comment thread tensorrt_llm/_torch/pyexecutor/dwdp.py
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #42962 [ run ] triggered by Bot. Commit: 657b225 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #42962 [ run ] completed with state SUCCESS. Commit: 657b225
/LLM/main/L0_MergeRequest_PR pipeline #33619 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@JintaoPengCS
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43135 [ run ] triggered by Bot. Commit: 657b225 Link to invocation

Copy link
Copy Markdown
Collaborator

@Superjomn Superjomn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM on the llmapi changes.

Comment thread tensorrt_llm/_torch/pyexecutor/dwdp.py Outdated
Comment thread tensorrt_llm/llmapi/llm_args.py
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43135 [ run ] completed with state SUCCESS. Commit: 657b225
/LLM/main/L0_MergeRequest_PR pipeline #33766 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

…tention_opt

The default value is already set to 2 MiB at module level, making this
override unnecessary.

Signed-off-by: JtaoPeng <[email protected]>
@JintaoPengCS
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43213 [ run ] triggered by Bot. Commit: cff3c6c Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

@JintaoPengCS JintaoPengCS enabled auto-merge (squash) April 14, 2026 13:17
@JintaoPengCS
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43255 [ run ] triggered by Bot. Commit: fb712d6 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

@JintaoPengCS
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast --reuse-test

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44190 [ run ] triggered by Bot. Commit: 5008c68 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44190 [ run ] completed with state SUCCESS. Commit: 5008c68
/LLM/main/L0_MergeRequest_PR pipeline #34616 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@JintaoPengCS
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast --reuse-test

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44237 [ run ] triggered by Bot. Commit: 5008c68 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44237 [ run ] completed with state FAILURE. Commit: 5008c68
/LLM/main/L0_MergeRequest_PR pipeline #34659 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@JintaoPengCS
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast --reuse-test

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44349 [ run ] triggered by Bot. Commit: 5008c68 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44349 [ run ] completed with state FAILURE. Commit: 5008c68
/LLM/main/L0_MergeRequest_PR pipeline #34767 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@JintaoPengCS
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast --reuse-test

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44415 [ run ] triggered by Bot. Commit: 5008c68 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44415 [ run ] completed with state SUCCESS. Commit: 5008c68
/LLM/main/L0_MergeRequest_PR pipeline #34827 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@JintaoPengCS
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44653 [ run ] triggered by Bot. Commit: 81bc4e7 Link to invocation

@JintaoPengCS
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44688 [ run ] triggered by Bot. Commit: a1f0d13 Link to invocation

@JintaoPengCS
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast --reuse-test

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44901 [ run ] triggered by Bot. Commit: a1f0d13 Link to invocation

@JintaoPengCS
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45048 [ run ] triggered by Bot. Commit: eb8c955 Link to invocation

@JintaoPengCS
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45214 [ run ] triggered by Bot. Commit: eb8c955 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45214 [ run ] completed with state SUCCESS. Commit: eb8c955
/LLM/main/L0_MergeRequest_PR pipeline #35481 completed with status: 'SUCCESS'

CI Report

Link to invocation

@JintaoPengCS JintaoPengCS merged commit 79396d8 into NVIDIA:main Apr 23, 2026
5 checks passed
ziyixiong-nv pushed a commit to ziyixiong-nv/TensorRT-LLM that referenced this pull request Apr 24, 2026
tianyuz-nv added a commit to tianyuz-nv/TensorRT-LLM that referenced this pull request May 25, 2026
Re-implements the DWDP IPC-era contention optimization (PR NVIDIA#12974) on
top of the VA-based ``DWDPWeightManager.prefetch_layer``.  The
``DwdpConfig.contention_opt`` field that PR NVIDIA#12974 added is already
carried through the schema; this commit makes it functional again on
the VA path.

When ``contention_opt=True``, ``prefetch_layer`` issues a single
``cudaMemcpyBatchAsync`` per layer with sub-slices interleaved across
peers in 2 MiB chunks (``MEMCPY_BATCH_SLICE_BYTES``), so that any
moment in flight touches every peer's NVLink link rather than
saturating one.  When ``contention_opt=False`` (default) the existing
per-slice ``torch.Tensor.copy_`` path is unchanged.

Adaptations from the IPC original onto the VA composite-VA structure:

  * Plan construction walks the same ``get_remote_slices`` / per-peer
    chunking that ``_prefetch_layer_per_slice`` uses (so Mode B's
    non-uniform / overlapping ranges resolved by ``lookup_owner``
    interleave correctly, not just uniform partitions).
  * Each peer's per-layer contribution is split into 2 MiB sub-slices,
    then a round-robin emitter across peers concatenates them into the
    final ``(dst_ptrs, src_ptrs, sizes)`` triple consumed by
    ``cudaMemcpyBatchAsync``.
  * Plans are cached per ``layer_idx`` in ``self._batched_copy_plans``
    since peer ``data_ptr()`` and destination slice ``data_ptr()`` are
    stable after ``setup_dwdp`` returns; this avoids host-side rebuild
    on every prefetch and matches the cache discipline used by PR
    NVIDIA#12974.

Plumbing:

  * ``DwdpConfig.contention_opt`` -> ``DwdpManager.setup()`` ->
    ``setup_dwdp(contention_opt=...)`` ->
    ``DWDPWeightManager(contention_opt=...)``.
  * ``DWDPWeightManager.__slots__`` extended with
    ``_use_batched_prefetch`` and ``_batched_copy_plans``.

Signed-off-by: tianyuz-nv <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants