[None][feat] Move DWDP contention optimization into DwdpConfig#12974
Conversation
Move DWDP contention optimization from environment variable (TRTM_dwdp_contention_opt) to a proper DwdpConfig field. When enabled, uses cudaMemcpyBatchAsync with round-robin slice ordering and 2 MiB slice granularity to reduce NVLink contention during expert prefetch. Signed-off-by: jintaop <[email protected]> Signed-off-by: JtaoPeng <[email protected]>
Signed-off-by: jintaop <[email protected]> Signed-off-by: JtaoPeng <[email protected]>
|
/bot run --disable-fail-fast |
📝 WalkthroughWalkthroughThis PR introduces a batched memory copy optimization for the DWDP module. A new configuration option Changes
Sequence Diagram(s)sequenceDiagram
participant DM as DwdpManager
participant CUDA as CUDA Stream
participant MEM as GPU Memory
rect rgba(100, 150, 200, 0.5)
Note over DM,MEM: Baseline Path (use_batched_prefetch=False)
DM->>DM: prefetch_layer()
DM->>DM: _prefetch_layer_default()
loop For each remote peer
DM->>CUDA: cudaMemcpyAsync(full region)
CUDA->>MEM: Copy entire num_prefetch_experts
end
CUDA->>CUDA: Synchronize
end
rect rgba(150, 200, 100, 0.5)
Note over DM,MEM: Optimized Path (use_batched_prefetch=True)
DM->>DM: prefetch_layer()
DM->>DM: _build_batched_prefetch_copy_plan()
DM->>DM: Split copies into MEMCPY_BATCH_SLICE_BYTES slices
DM->>DM: Interleave slices round-robin across peers
DM->>CUDA: cudaMemcpyBatchAsync(assembled pointers/sizes)
CUDA->>MEM: Copy interleaved slices in batched call
CUDA->>CUDA: Synchronize
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
tensorrt_llm/_torch/pyexecutor/dwdp.py (2)
265-269: Redundant global reassignment.The
MEMCPY_BATCH_SLICE_BYTESglobal is already initialized to1 << 21at module level (line 25). Reassigning it to the same value whencontention_optis enabled has no effect. Consider removing this block unless the intent is to allow configuration of different slice sizes in the future.♻️ Suggested simplification
self.use_batched_prefetch = config.contention_opt - - if config.contention_opt: - global MEMCPY_BATCH_SLICE_BYTES - MEMCPY_BATCH_SLICE_BYTES = 1 << 21🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/pyexecutor/dwdp.py` around lines 265 - 269, The global reassignment of MEMCPY_BATCH_SLICE_BYTES inside the dwdp.PyExecutor constructor is redundant because MEMCPY_BATCH_SLICE_BYTES is already initialized to 1 << 21 at module level; remove the block that sets MEMCPY_BATCH_SLICE_BYTES when config.contention_opt is true (the lines referencing global MEMCPY_BATCH_SLICE_BYTES and setting it to 1 << 21) or, if you intend to allow a different slice size via config, replace that block with reading a new config property (e.g., config.memcpy_batch_slice_bytes) and assign it to MEMCPY_BATCH_SLICE_BYTES accordingly, ensuring you reference the same global symbol in PyExecutor and document the config option.
620-621: Inconsistent None check between prefetch paths.The batched path (
_prefetch_layer_batched→_build_batched_prefetch_copy_plan) includes a defensive None check fordst_tensor(lines 584-587), but the default path here does not. Whiledst_tensorshouldn't be None forpeer_rank != self.dwdp_rankbased on buffer initialization logic, consider adding the same guard for consistency and defensive coding.🛡️ Suggested defensive check
dst_tensor = self.prefetch_buffer.buffers[buffer_idx][param_name][peer_rank] + if dst_tensor is None: + raise RuntimeError( + f"Missing prefetch buffer for peer_rank={peer_rank} param={param_name}" + ) dst_ptr = dst_tensor.data_ptr()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/pyexecutor/dwdp.py` around lines 620 - 621, Add the same defensive None check used in the batched prefetch path here: when retrieving dst_tensor via self.prefetch_buffer.buffers[buffer_idx][param_name][peer_rank] inside the non-batched prefetch flow, verify dst_tensor is not None before calling dst_tensor.data_ptr(); if it is None, skip this copy/continue (or handle the missing buffer the same way _prefetch_layer_batched/_build_batched_prefetch_copy_plan do). This change should reference dst_tensor, buffer_idx, param_name, peer_rank and keep behavior consistent with _prefetch_layer_batched/_build_batched_prefetch_copy_plan.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tensorrt_llm/_torch/pyexecutor/dwdp.py`:
- Around line 651-660: The call to cudart.cudaMemcpyBatchAsync requires CUDA
12.8+; update the code around the cudaMemcpyBatchAsync invocation (reference
cudart.cudaMemcpyBatchAsync and self.prefetch_buffer.prefetch_stream.cuda_stream
in dwdp.py) to either document this requirement in the module docstring/comments
and/or add a runtime check (use the CUDA runtime/version API such as
cudart.cudaRuntimeGetVersion or similar) before calling cudaMemcpyBatchAsync and
raise a clear, actionable error if the detected CUDA version is less than 12.8
(e.g., advise upgrading CUDA or fallback to a supported memcpy path).
---
Nitpick comments:
In `@tensorrt_llm/_torch/pyexecutor/dwdp.py`:
- Around line 265-269: The global reassignment of MEMCPY_BATCH_SLICE_BYTES
inside the dwdp.PyExecutor constructor is redundant because
MEMCPY_BATCH_SLICE_BYTES is already initialized to 1 << 21 at module level;
remove the block that sets MEMCPY_BATCH_SLICE_BYTES when config.contention_opt
is true (the lines referencing global MEMCPY_BATCH_SLICE_BYTES and setting it to
1 << 21) or, if you intend to allow a different slice size via config, replace
that block with reading a new config property (e.g.,
config.memcpy_batch_slice_bytes) and assign it to MEMCPY_BATCH_SLICE_BYTES
accordingly, ensuring you reference the same global symbol in PyExecutor and
document the config option.
- Around line 620-621: Add the same defensive None check used in the batched
prefetch path here: when retrieving dst_tensor via
self.prefetch_buffer.buffers[buffer_idx][param_name][peer_rank] inside the
non-batched prefetch flow, verify dst_tensor is not None before calling
dst_tensor.data_ptr(); if it is None, skip this copy/continue (or handle the
missing buffer the same way
_prefetch_layer_batched/_build_batched_prefetch_copy_plan do). This change
should reference dst_tensor, buffer_idx, param_name, peer_rank and keep behavior
consistent with _prefetch_layer_batched/_build_batched_prefetch_copy_plan.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 70f5fe9a-55d6-498f-aeea-677d045b5740
📒 Files selected for processing (2)
tensorrt_llm/_torch/pyexecutor/dwdp.pytensorrt_llm/llmapi/llm_args.py
|
PR_Github #42962 [ run ] triggered by Bot. Commit: |
|
PR_Github #42962 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #43135 [ run ] triggered by Bot. Commit: |
Superjomn
left a comment
There was a problem hiding this comment.
LGTM on the llmapi changes.
|
PR_Github #43135 [ run ] completed with state
|
…tention_opt The default value is already set to 2 MiB at module level, making this override unnecessary. Signed-off-by: JtaoPeng <[email protected]>
|
/bot run --disable-fail-fast |
|
PR_Github #43213 [ run ] triggered by Bot. Commit: |
|
PR_Github #43213 [ run ] completed with state |
|
/bot run --disable-fail-fast |
|
PR_Github #43255 [ run ] triggered by Bot. Commit: |
|
PR_Github #43255 [ run ] completed with state |
|
/bot run --disable-fail-fast --reuse-test |
|
PR_Github #44190 [ run ] triggered by Bot. Commit: |
|
PR_Github #44190 [ run ] completed with state
|
|
/bot run --disable-fail-fast --reuse-test |
|
PR_Github #44237 [ run ] triggered by Bot. Commit: |
|
PR_Github #44237 [ run ] completed with state
|
|
/bot run --disable-fail-fast --reuse-test |
|
PR_Github #44349 [ run ] triggered by Bot. Commit: |
|
PR_Github #44349 [ run ] completed with state
|
|
/bot run --disable-fail-fast --reuse-test |
|
PR_Github #44415 [ run ] triggered by Bot. Commit: |
|
PR_Github #44415 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #44653 [ run ] triggered by Bot. Commit: |
|
/bot run --disable-fail-fast |
|
PR_Github #44688 [ run ] triggered by Bot. Commit: |
|
/bot run --disable-fail-fast --reuse-test |
|
PR_Github #44901 [ run ] triggered by Bot. Commit: |
|
/bot run --disable-fail-fast |
|
PR_Github #45048 [ run ] triggered by Bot. Commit: |
|
/bot run --disable-fail-fast |
|
PR_Github #45214 [ run ] triggered by Bot. Commit: |
|
PR_Github #45214 [ run ] completed with state |
…A#12974) Signed-off-by: jintaop <[email protected]> Signed-off-by: JtaoPeng <[email protected]>
Re-implements the DWDP IPC-era contention optimization (PR NVIDIA#12974) on top of the VA-based ``DWDPWeightManager.prefetch_layer``. The ``DwdpConfig.contention_opt`` field that PR NVIDIA#12974 added is already carried through the schema; this commit makes it functional again on the VA path. When ``contention_opt=True``, ``prefetch_layer`` issues a single ``cudaMemcpyBatchAsync`` per layer with sub-slices interleaved across peers in 2 MiB chunks (``MEMCPY_BATCH_SLICE_BYTES``), so that any moment in flight touches every peer's NVLink link rather than saturating one. When ``contention_opt=False`` (default) the existing per-slice ``torch.Tensor.copy_`` path is unchanged. Adaptations from the IPC original onto the VA composite-VA structure: * Plan construction walks the same ``get_remote_slices`` / per-peer chunking that ``_prefetch_layer_per_slice`` uses (so Mode B's non-uniform / overlapping ranges resolved by ``lookup_owner`` interleave correctly, not just uniform partitions). * Each peer's per-layer contribution is split into 2 MiB sub-slices, then a round-robin emitter across peers concatenates them into the final ``(dst_ptrs, src_ptrs, sizes)`` triple consumed by ``cudaMemcpyBatchAsync``. * Plans are cached per ``layer_idx`` in ``self._batched_copy_plans`` since peer ``data_ptr()`` and destination slice ``data_ptr()`` are stable after ``setup_dwdp`` returns; this avoids host-side rebuild on every prefetch and matches the cache discipline used by PR NVIDIA#12974. Plumbing: * ``DwdpConfig.contention_opt`` -> ``DwdpManager.setup()`` -> ``setup_dwdp(contention_opt=...)`` -> ``DWDPWeightManager(contention_opt=...)``. * ``DWDPWeightManager.__slots__`` extended with ``_use_batched_prefetch`` and ``_batched_copy_plans``. Signed-off-by: tianyuz-nv <[email protected]>
Move DWDP contention optimization from environment variable (TRTM_dwdp_contention_opt) to a proper DwdpConfig field. When enabled, uses cudaMemcpyBatchAsync with round-robin slice ordering and 2 MiB slice granularity to reduce NVLink contention during expert prefetch.
Summary by CodeRabbit
Release Notes
New Features
contention_optconfiguration option to DWDP settings for memory optimization control.Performance
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.