[TRTLLM-11457][feat] Async Ulysses pipeline (Enabled for LTX-2 + WAN)#13978
[TRTLLM-11457][feat] Async Ulysses pipeline (Enabled for LTX-2 + WAN)#13978luyiyun1021 wants to merge 20 commits into
Conversation
|
/bot run --disable-fail-fast |
|
PR_Github #47692 [ run ] triggered by Bot. Commit: |
📝 WalkthroughWalkthroughThis PR implements an asynchronous split-QKV pipeline for Ulysses sequence parallelism. It adds NCCL LSA barrier support, a hybrid all-to-all v11 service combining peer copies with self-copy streaming, Green Context stream partitioning, and integrates pipelined Q/K/V pre-computation into LTX2 attention with rank synchronization via LSA barriers. ChangesUlysses Split-QKV LSA Barrier Pipeline
Sequence Diagram(s)sequenceDiagram
participant Client as LTX2Attention
participant Pipeline as Ulysses Pipeline
participant VCompute as V Compute<br/>(gc_comp stream)
participant QKCompute as Q/K Compute<br/>(pri_comm stream)
participant SelfCopy as Self-Copy<br/>(gc_selfcopy stream)
participant Peers as Peer Copy<br/>(memcpy)
participant Barrier as LSA Barrier
participant Backend as Inner Backend SDPA
Client->>Pipeline: forward_with_pipeline(x, to_q, to_k, to_v, ...)
par V Stream
Pipeline->>VCompute: norm_v + to_v (5D slot layout)
VCompute-->>Pipeline: v_5d
and Q/K Stream
Pipeline->>QKCompute: norm_q/k + rope + to_q/k (5D)
QKCompute-->>Pipeline: q_5d, k_5d
end
par Self-Copy
Pipeline->>SelfCopy: self CUDA memcpy (windowed)
and Peer-Copy
Pipeline->>Peers: peer memcpy (windowed)
end
Pipeline->>Barrier: emit() on pri_comm stream
Barrier-->>Pipeline: release-ordered rank sync
Pipeline->>Backend: forward(v_5d, q_5d, k_5d) inner SDPA
Backend-->>Pipeline: sdpa_out_5d
Pipeline->>Pipeline: post_permute (5D → 4D)
Pipeline-->>Client: output tensor
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
tensorrt_llm/_torch/distributed/_ulysses_gc.py (1)
1-2: ⚡ Quick winUpdate the copyright year on this new file.
This file is newly added in 2026, so the NVIDIA header should carry the current modification year as required by the repo rules.
As per coding guidelines "All C++, Python, and other source files must contain NVIDIA copyright header with current modification year".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tensorrt_llm/_torch/distributed/_ulysses_gc.py` around lines 1 - 2, The file header comment currently shows "2025" as the copyright/modification year; update the SPDX header at the top of tensorrt_llm/_torch/distributed/_ulysses_gc.py to use the current year "2026" (both occurrences in the two header lines) so the NVIDIA copyright header matches repo rules.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@cpp/tensorrt_llm/thop/alltoallOp.cpp`:
- Around line 401-418: The fast-path check of mLsaBarrier races with the
lazy-init under mLsaBarrierMutex; fix by synchronizing publication before
calling emit(): acquire mLsaBarrierMutex (or use std::call_once) to ensure
creation via tensorrt_llm::kernels::LsaBarrier::create completes and the pointer
is visible, then copy mLsaBarrier to a local variable, release the lock, check
it with TLLM_CHECK_WITH_INFO and call local->emit(stream). Ensure all uses of
mLsaBarrier->emit() use the locally-copied pointer so no emit runs on a
concurrently-written pointer.
In `@tensorrt_llm/_torch/distributed/_ulysses_gc.py`:
- Around line 264-277: The current broad "except Exception" around the
_create_two_partitions call hides unrelated bugs; replace it with catching only
the specific errors that indicate partition setup failure (e.g., RuntimeError
and any CUDA/OSError that _create_two_partitions can raise) and let all other
exceptions propagate. Concretely, change the except clause around
_create_two_partitions(...) to "except (RuntimeError, OSError) as e:" (or the
exact exception types _create_two_partitions documents), keep the warnings.warn
fallback logic, and do not swallow other exceptions so bugs in
_create_two_partitions or surrounding code fail loudly.
- Around line 291-297: The per-device singleton cache in
UlyssesPipelineStreams.get is racy; wrap the lookup/create path with a
class-level lock to ensure only one thread creates and publishes an instance for
a given device_id. Add a class attribute (e.g., _instances_lock =
threading.Lock()) on the UlyssesPipelineStreams class, initialize it alongside
_instances, and in the get(cls, device_id, ...) method acquire the lock before
checking cls._instances.get(device_id) and creating/storing a new instance, then
release the lock (use context manager) after the update so concurrent callers
cannot create duplicate instances.
---
Nitpick comments:
In `@tensorrt_llm/_torch/distributed/_ulysses_gc.py`:
- Around line 1-2: The file header comment currently shows "2025" as the
copyright/modification year; update the SPDX header at the top of
tensorrt_llm/_torch/distributed/_ulysses_gc.py to use the current year "2026"
(both occurrences in the two header lines) so the NVIDIA copyright header
matches repo rules.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 27c2f052-21cb-46e3-8eaf-41dbebc0318d
📒 Files selected for processing (8)
cpp/tensorrt_llm/kernels/lsaBarrierKernel.cucpp/tensorrt_llm/kernels/lsaBarrierKernel.hcpp/tensorrt_llm/thop/alltoallOp.cpptensorrt_llm/_torch/custom_ops/cpp_custom_ops.pytensorrt_llm/_torch/distributed/_ulysses_gc.pytensorrt_llm/_torch/visual_gen/attention_backend/parallel.pytensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/utils_ltx2.pytensorrt_llm/_torch/visual_gen/models/ltx2/transformer_ltx2.py
|
PR_Github #47692 [ run ] completed with state
|
8ab04d2 to
881d06d
Compare
a3933c6 to
6585bc6
Compare
7f6e7e6 to
b5a86a9
Compare
|
PR_Github #52496 [ run ] triggered by Bot. Commit: |
|
PR_Github #52468 [ run ] completed with state |
zhenhuaw-me
left a comment
There was a problem hiding this comment.
I left some comments about the non-kernel part. Feel free to address in the follow up PRs if applicable to unblock the PR merging as-is.
| status="prototype", | ||
| description=("Ulysses head-sharding degree. Heads are sharded across ulysses_size GPUs."), | ||
| ) | ||
| async_ulysses: bool = Field( |
There was a problem hiding this comment.
I thought this feature is always on if Ulysses is enabled since it should not have any perf penalty. Did I miss any case? If that's the case, we don't need a knob.
Feel free to resolve this comment to unblock the PR merging. If we converged that we don't need the knob, we can address in upcoming PRs.
There was a problem hiding this comment.
Flux and cosmos not supported yet. Since different models has different attn module computation so we have to customize the pipeline. This may need some extra work we may leave it to future pr.
There was a problem hiding this comment.
No perf penalty in our test cases but since currently we have cuda graph perf issues with cudaMemcpyBatchAsync. I think we may leave some flexibilty in case there comes perf penalty. How do you think?
b5a86a9 to
a861c81
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #52513 [ run ] triggered by Bot. Commit: |
|
PR_Github #52496 [ run ] completed with state |
…ses=True conflict Pre-fix, async_ulysses=True forces qkv_mode=SEPARATE_QKV (so V/Q/K projections can stream independently), which bypassed the FUSE_QKV-only Ring gate (`qkv_mode != SEPARATE_QKV`) — Ring was silently disabled. Even at the wrap level it would not actually work end-to-end: `UlyssesAttention.forward_async` dispatches straight to `self.inner_backend.forward_async` and never invokes the wrapped `RingAttention.forward`, so Ring would be no-op under async even if the wrap had been allowed. Real Ring+async support requires `RingAttention.forward_async` + pass-through in `UlyssesAttention.forward_async` — out of scope for this PR. For now, raise ValueError on the conflicting combo so users opting into both don't silently lose Ring. Reported in PR review by NVShreyas. Signed-off-by: Yiyun Lu <[email protected]>
…IPC constraint Documents the async Ulysses A2A pipeline (`async_ulysses` in `ParallelConfig`) added by this PR series across three places: * docs/source/models/visual-generation.md — adds a sub-bullet under Ulysses Parallelism in the Multi-GPU section, noting the requirement on an NVLink-connected domain (PyTorch `_SymmetricMemory` + CUDA IPC for peer pushes; not currently supported across nodes without MNNVL). * examples/visual_gen/README.md — mirrors the note under WAN's Multi-GPU Parallelism section and adds a new Multi-GPU Parallelism sub-section under LTX-2 (LTX-2 supports head-sharded Ulysses with the same divisibility constraint and benefits from async_ulysses the same way). * examples/visual_gen/configs/wan2.2-t2v-fp4-4gpu.yaml — opts the 4-GPU WAN reference serve config into `async_ulysses: true` (NVLink domain on a single 8-GPU node, the standard target for this config). Uses the field name as it appears in code (`async_ulysses`); reviewer flagged that the PR description still references the older `dit_async_ulysses` name. Signed-off-by: Yiyun Lu <[email protected]>
…lysses=true Mirrors the wan2.2-t2v-fp4-4gpu.yaml structure but for LTX-2: cfg_size=2 + ulysses_size=2 + async_ulysses=true. Precision-agnostic (no quant_config block; checkpoint precision is selected via --model_path). Signed-off-by: Yiyun Lu <[email protected]>
… enable torch_compile + cuda_graph in LTX-2 4-GPU config * wan2.2-t2v-fp4-4gpu.yaml and ltx2-4gpu.yaml previously carried a 3-line comment under `parallel_config` explaining the MNNVL/CUDA IPC requirement of `async_ulysses`. The same explanation lives in docs/source/models/visual-generation.md and examples/visual_gen/README.md already, so the inline yaml comment was redundant — dropped from both. * ltx2-4gpu.yaml: enable torch_compile and cuda_graph (`enable: true` for both). These are standard prod knobs for the 4-GPU LTX-2 path; the WAN 4-GPU config is left as-is (`cuda_graph_config.enable: false`). Signed-off-by: Yiyun Lu <[email protected]>
Address Codex/reviewer findings on the async Ulysses A2A pipeline. No functional change to the hot path; existing ws=2 bit-exact parity tests are unaffected. C++ transport plane (asyncUlyssesOp.cpp): - ulysses_a2a_async_prepare: reject non-CUDA input and install c10::cuda::CUDAGuard so slot allocation (cudaGetDevice) and kernel launch (input.get_device()) bind to the same device. - SendHandle: add group_name field; ulysses_a2a_async validates handle.group_name == pg.getGroupName() to reject cross-PG handle reuse (two PGs of the same size would otherwise pass the peer-pointer-count check and silently push into the wrong group). - AsyncUlyssesOp::getOrAllocSlot: build new slot state in local variables; commit (move-assign + free old sendBuf) only after all four allocation steps (symm-mem empty_strided_p2p + rendezvous + get_buffer_ptrs + cudaMalloc) succeed. Prior progressive mutation left a poisoned cached slot on partial-failure paths. C++ op bindings: - ulyssesPostUnscatterOp.cpp: add TORCH_CHECK(D % 8 == 0) at op level (mirrors sibling ulysses_permute_scatter); early-return on empty Q/K/V to avoid zero-grid kernel launches. - ulyssesPermuteScatterOp.cpp: add TORCH_CHECK(P > 0) before H % P to avoid division-by-zero / SIGFPE on bad schema-level callers; early-return on empty input. Python: - modules/attention.py::Attention.forward_async: prescriptive ValueError when self.attn lacks forward_async (caller used async_ulysses=False at init), replacing the deep AttributeError. Signed-off-by: Yiyun Lu <[email protected]>
…o NHD layout
The fused post-A2A unscatter kernel previously only emitted HND
[B, H, P*Sp, D], so async Ulysses with TRTLLM / FA4 backends (which
prefer NHD) fell back to the eager 3-copy permute path. Extend the
kernel and op binding to also emit NHD [B, P*Sp, H, D] in one launch,
so all bf16 async-Ulysses callers benefit regardless of backend
layout. HND fast-path callers are unaffected (default layout=0).
Changes:
- ulyssesPostUnscatterKernel.{cu,h}: add `bool IsHnd` template arg;
if constexpr branch on out_base. Launcher dispatches two template
instances based on a new `bool is_hnd` runtime arg.
- ulyssesPostUnscatterOp.cpp: schema adds `int layout=0` (0=HND,
1=NHD); op validates layout, allocates outputs with the matching
shape, and passes is_hnd to the kernel launcher. Default keeps
backward compatibility with existing callers.
- cpp_custom_ops.py: register_fake takes the new layout=0 default
and branches the fake output shape accordingly.
- visual_gen/attention_backend/parallel.py:
- Rename helper _ulysses_post_unscatter_to_hnd to
_ulysses_post_unscatter with is_hnd kwarg.
- Rename gate flag use_fused_op to use_fused_post_unscatter and
drop the HND-only condition; the fused kernel now covers both
layouts so the gate is just bf16.
- Eager fallback (non-bf16) keeps the existing NHD/HND post-permute
logic unchanged.
- test_ulysses_post_unscatter.py: parametrize exact_match test on
layout (HND + NHD); add an explicit reject test for invalid layout
values; reference function branches on is_hnd to skip the final
transpose for NHD.
Signed-off-by: Yiyun Lu <[email protected]>
… transpose-view cudnn SDPA preserves the input's stride pattern in its output. The sync `_forward_unfused` path passes HND-shape NHD-stride tensors (via `q.transpose(1, 2)` without `.contiguous()`), so cudnn returns HND-shape NHD-stride output and the downstream `_output_a2a.transpose(1, 2).contiguous()` collapses to a no-op. The async `forward_async` path used `_ulysses_post_unscatter(is_hnd=True)` to allocate HND-contig storage. cudnn then returned HND-contig output, and `_output_a2a`'s transpose+contiguous required a real memcpy — observable in nsys as a 62us `triton_poi_fused_all_to_all_single_clone_permute_transpose_view_0` kernel between SDPA and reverse NCCL (absent in OFF). At WAN A14B 720x1280/81f this is ~5 ms / step of avoidable layout cost. Make the post-unscatter kernel always write NHD-contig storage `[B, P*Sp, H, D]`; the op wrapper returns it as-is for NHD callers and as a transpose-view `[B, H, P*Sp, D]` for HND callers (HND-shape, NHD-stride, non-contig — mirrors the sync path). cudnn then preserves NHD-stride through SDPA and the post-attention contiguous() is free. - Kernel: drop `IsHnd` template param + `if constexpr` branch, single NHD output address calculation. - Op: always alloc NHD storage; return `.transpose(1, 2)` view for layout=0. - Fake op: mirror via `new_empty(NHD).transpose(1, 2)` so Inductor sees matching strides. - Test: update contiguity assertion (HND output is now non-contig view). `max_diff == 0` exact-match still holds. Signed-off-by: Yiyun Lu <[email protected]>
…self-attn
Under async ulysses, video self-attn forces ``QKVMode.SEPARATE_QKV`` so
the three projections (to_q/to_k/to_v) can stream-pipeline through the
A2A. With NVFP4 static quant this triggers three identical
``tunable_fp4_quantize`` launches on the same hidden_states (verified
bit-equal input_scale across all WAN-2.2 high-noise 36 / low-noise 36
and LTX-2 single-stage 42 self-attn layers).
Pre-quantize the shared input once and pass the resulting
``Fp4QuantizedTensor`` to each Linear via the existing fp4-shortcut
path in NVFP4LinearMethod._input_prepare. Two of three quantize launches
per self-attn block per layer are eliminated (verified -368 launches in
the WAN A14B 8-GPU step capture; identical kernel count between
async-OFF FUSE_QKV and async-ON SEPARATE_QKV in LTX-2 prod-shape).
Gating:
- Structural at __init__: SEPARATE_QKV, NVFP4 layer mode, no
force_dynamic_quantization. Stored on ``_maybe_share_qkv_quantize``.
- Runtime on the forward path: ``getattr(to_q, "input_scale", None) is
not None`` -- guards against checkpoints that exclude individual
Linears from NVFP4 (e.g. LTX-2 transformer_blocks.10.attn1, where
to_q/to_k/to_v are bf16-only and never load input_scale).
Applied in the two async-self-attn paths:
- ``Attention.forward_async`` (base async self-attn path used by WAN).
- ``LTX2Attention.forward_async`` (LTX-2 has its own override; mirrors
the base path's pre-quantize + share contract).
Signed-off-by: Yiyun Lu <[email protected]>
…oin_async
The original ``ulysses_a2a_async`` op did the CE peer push and the
symm-mem barrier back-to-back on the side stream, so V/Q/K each landed
``[push, barrier, push, barrier, push, barrier]`` on the comm stream.
Each barrier serialized with the next push, even though no cross-rank
ordering is required between issues -- only the final wait before SDPA
reads the recv buffer needs the fence.
Split the op into ``ulysses_a2a_async_push`` (CE push only) and
``ulysses_a2a_async_barrier`` (emit barrier on channel 0), keep the
original ``ulysses_a2a_async`` as a thin push+barrier alias for
backward compatibility, and rewire ``UlyssesAttention``:
- ``_issue_async`` calls ``_push`` and bumps ``_pending_barriers``.
- ``_join_async`` drains exactly ``_pending_barriers`` barriers on the
side stream and then records the tail event for the default-stream
wait.
Comm-stream FIFO preserves [push V, push Q, push K, barrier, barrier,
barrier]; channel-0 barriers all have identical semantics, so the
default-stream wait sees a fully-synced recv buffer. WAN 8-GPU step
capture shows the per-call barrier_kernel time drops from ~164us to
~93us (-43%) and consecutive V/Q/K pushes run through CE without the
intervening fence kernel cutting them apart (PTOP throughput -4ms /
step). End-to-end wall is unchanged in WAN because barrier and push
already sat on the side stream; the win is on energy / GPU work and
opens room for follow-up overlap tuning.
Signed-off-by: Yiyun Lu <[email protected]>
…Kernel The original SPDX-only headers (`SPDX-FileCopyrightText` + `SPDX-License-Identifier`) did not parse cleanly under the `license_checker` used by the C++ CI stage, which expects the full Apache-2.0 boilerplate that the rest of `cpp/tensorrt_llm/kernels/` already uses. Switch the two ulyssesPermuteScatterKernel files to that format; no code change. Signed-off-by: Yiyun Lu <[email protected]>
… rename async_pipeline kwarg to async_ulysses for consistency The user-facing ParallelConfig flag, Attention.__init__, LTX2Attention, and WanBlock all use async_ulysses. UlyssesAttention's constructor and wrap_parallel_attention used async_pipeline (introduced by the Option B refactor in the symm-mem CUDA-IPC commit). The split was theoretically motivated (wrapper-internal vs user-feature) but in practice every caller mapped 1:1, and the divergent name hurt grep + reviewer comprehension. This renames the 5 occurrences for consistency. Signed-off-by: Yiyun Lu <[email protected]>
a861c81 to
0b03cb6
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #52515 [ run ] triggered by Bot. Commit: |
|
PR_Github #52513 [ run ] completed with state |
|
PR_Github #52515 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #52540 [ run ] triggered by Bot. Commit: |
|
PR_Github #52540 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #52576 [ run ] triggered by Bot. Commit: |
|
PR_Github #52576 [ run ] completed with state
|
@coderabbitai summary
Description
Adds an opt-in async Ulysses pipeline for sequence parallelism on diffusion video models (LTX-2 + WAN 2.2). The pipeline overlaps per-rank V/Q/K compute (
GEMM → RMSNorm → RoPE) with cross-rank data movement (PyTorch_SymmetricMemoryCUDA-IPC peer access + symm-membarrierrelease fence) on a single dedicated per-device side stream. SMs stay free for the next V/Q/K while the GPU Copy Engine drains the prior V/Q/K's peer push.Opt-in gate:
parallel_config.async_ulysses(default False). Both LTX-2 (LTX2Attention.__init__(async_ulysses=...)) and WAN 2.2 self-attn read the same gate. Non-Ulysses, audio, cross-attn, andulysses_size == 1paths are unchanged.Requires PyTorch ≥ 2.5 for
torch._C._distributed_c10d._SymmetricMemory.empty_strided_p2p/rendezvous/barrier. No NCCL device-API dependency.Enabling
YAML config (
trtllm-serve --config foo.yaml):Python API (
VisualGenArgs):A/B comparison against the baseline blocking-A2A path needs zero code change — flip the bool.
Design
Stream lanes per attention call
Issue order V → Q → K. V's CE push overlaps with Q's compute; Q's push overlaps with K's compute. Three barriers fire only at
_join_async, after which the default stream waits the tail event before SDPA. Side stream is a per-device singleton (UlyssesAttention._side_stream_by_device) so all transformer layers on the same device share one comm lane — avoids per-layer stream proliferation under cuda_graph capture / inductor.C++ ops surface
AsyncUlyssesOpcaches a canonicalSymmetricMemoryhandle on the first slot allocation (mCanonicalHandle);emitBarrier()uses it directly instead of scanning slots each call.forward_asyncAPI (closure-based)@torch.compiler.disable(recursive=False)on_issue_async/_join_asyncis the only stream-switch boundary. The caller'scompute_q/k/vclosures execute on the default stream between successive_issue_asyncgraph breaks, so their bodies sit inside the outer block's inductor graph and get fullGEMM+norm+RoPEfusion (verifiednvjet_sm100_ootst_..._Avec16UE4M3_Bvec16UE4M3_TNTkernels emitted under NVFP4)._SymmetricMemoryslot ringAsyncUlyssesOpowns a 3-slot ring (V/Q/K per layer — intra-layer slot-reuse hazard requires ≥ 3, cross-layer reuse is safe because_join_asyncsyncs the side stream before next layer's V starts). First call at(numel, dtype)triggersempty_strided_p2p + rendezvous; subsequent calls reuse the cached(handle, peer_ptrs). Ops cached bypg->getGroupName()so multiple PGs (e.g. ulysses subgroup × cfg subgroup) coexist cleanly. Allocation is not graph-capture-safe (rendezvous / cudaMalloc inside) —forward_asyncrequires one out-of-capture warmup pass before cuda_graph capture starts.Key Optimizations
The four mechanisms below are what make this PR's async path actually faster than the baseline blocking A2A. Listed bottom-up from hardware-level resource choice to host-side launch reduction.
1. Copy-Engine peer push (avoid SM contention)
V/Q/K peer-to-peer data movement runs entirely on the GPU Copy Engines, never on SM-resident NCCL kernels:
cudaMemcpyBatchAsyncover theP-1per-peer chunks (driver fans out across CEs).cudaMemcpyAsyncloop (becausecudaMemcpyBatchAsyncisn't graph-capture-safe until CUDA 13.5 — see Follow-ups Bump onnx from 1.12.0 to 1.13.0 #1).The Copy Engine handles all
P-1peer pushes, leaving SMs fully free for the next V/Q/KGEMM → RMSNorm → RoPE. Compared to NCCL all-to-all (Send/Recv kernels on SMs competing for the same compute resource as the GEMMs), the CE-based push has disjoint hardware resources from compute — no channel multiplexing or ring-topology hacks needed.2. Fuse permute with self-chunk scatter (bypass CE local-D2D bandwidth cap)
ulyssesPermuteScatterKerneldoes two things in one launch:[B, S, H, D] → [P, B, S, H/P, D]slot.basePtr + my_rank * chunk_bytes); only theP-1peer chunks go toslot.sendBuf.Without the fuse, the alternative is: SM permute → sendBuf; then CE runs
Pmemcpys (P-1peer pushes + 1 local sendBuf → recv self-slot D2D copy). The CE's local D2D channel is hardware-capped at ~100 GB/s (independent from the NVLink/NVSwitch peer channel at ~400+ GB/s — they don't share bandwidth, but local D2D's own ceiling is low). Since the self chunk is the same size as one peer chunk, the local copy would become the critical path of the entire a2a — theP-1peer pushes finish 4× faster while we wait for one local D2D.Fusing moves the self-chunk write into the SM kernel, using HBM bandwidth (~5–10 TB/s) — one to two orders of magnitude above CE local-D2D. The alternative of green-context isolation for the local copy (separate CE queue) requires complex multi-stream/queue coordination; fusing bypasses that whole subsystem.
3. Deferred barriers (push×N FIFO, barrier×N drain at join)
V/Q/K's a2a is split into two C++ ops,
ulysses_a2a_async_push(CE push only) andulysses_a2a_async_barrier(symm-mem barrier release fence). Side-stream sequence:Originally the op was push+barrier fused, producing
[push, barrier, push, barrier, push, barrier]where each barrier kernel cut the CE FIFO and forced the next push to wait. Each barrier kernel drops from ~164 µs → ~93 µs (-43%) under the deferred scheme, and per-step CE PTOP throughput improves by ~4 ms (WAN 8-GPU step capture). Python state tracks_pending_barriersonUlyssesAttention;_join_asyncdrains exactly N barriers on the side stream before the default stream waits the tail event.4. Dedup
fp4_quantizeacross QKV self-attnUnder async ulysses the video self-attn is forced to
SEPARATE_QKV(so V/Q/K can stream-pipeline independently), but with NVFP4 static quant the three Linears all shareinput_scale:Pre-quantize hidden_states once and share the resulting
Fp4QuantizedTensoracross all three Linears via the existing fp4-shortcut path inNVFP4LinearMethod._input_prepare(which just unpacks.fp4_tensor/.scaling_factorfor GEMM — no internal swizzle so the SF layout is the same as Linear's own quantize path). Eliminates 2 of 3 quantize launches per self-attn block (verified −368 launches in WAN A14B 8-GPU step capture; identical kernel count between async-OFF FUSE_QKV and async-ON SEPARATE_QKV+dedup).Gating:
_maybe_share_qkv_quantize, set in__init__):SEPARATE_QKV+ NVFP4 + notforce_dynamic_quantization.getattr(to_q, "input_scale", None) is not None— guards against checkpoints that exclude individual Linears from NVFP4 (e.g. LTX-2transformer_blocks.10.attn1, bf16-only — falls back to per-Linear quantize).Benchmarks
Isolated all-to-all (B200, single rank)
bench_alltoall_ab.py— single-tensor a2a latency (ms), B=2, S_total=6144, H=32, D=128, warmup=30, bench=100.naive_fused=dist.all_to_all_singleon 5D fused QKV;naive_split= 3 separate 4D a2a;symm_ce= the new symm-mem CE pipeline (per V/Q/K call).symm_ceis 1.16–1.29× faster than the fused baseline in every cell exceptP=8 graph(where cuda_graph capture amortizes Python launch overhead in thenaive_fused1-collective path enough that the merged baseline wins).End-to-end on B200 (production scale)
torch.compile+cuda_graph)torch.compileonly)3 timed E2E runs/cell, median reported. Per-cell range is well below the OFF→ON delta in every case (non-overlap). WAN's larger relative gain reflects more compute-per-A2A and a longer overlap window per layer (no cuda_graph overhead to amortize the baseline path).
Files
cpp/tensorrt_llm/kernels/ulyssesPermuteScatterKernel.{h,cu}(new)cpp/tensorrt_llm/kernels/ulyssesPostUnscatterKernel.{h,cu}(new)cpp/tensorrt_llm/thop/asyncUlyssesOp.cpp(new)SendHandle+AsyncUlyssesOp(symm-mem slot ring +mCanonicalHandlecache + CE push + barrier). Ops:ulysses_a2a_async_prepare,ulysses_a2a_async_push,ulysses_a2a_async_barrier.cpp/tensorrt_llm/thop/ulyssesPermuteScatterOp.cpp(new)cpp/tensorrt_llm/thop/ulyssesPostUnscatterOp.cpp(new)cpp/tensorrt_llm/thop/CMakeLists.txttensorrt_llm/_torch/custom_ops/cpp_custom_ops.pyregister_fakefortrtllm::ulysses_post_unscatter_qkv.tensorrt_llm/_torch/visual_gen/config.pyasync_ulyssesPydantic field (default False).tensorrt_llm/_torch/visual_gen/attention_backend/parallel.pyUlyssesAttention(async_pipeline=...), side-stream singleton,_issue_async/_join_asyncdeferred-barrier pipeline, V/Q/K rollingforward_async.tensorrt_llm/_torch/visual_gen/modules/attention.pyasync_ulyssesflag; QKVfp4_quantizededup inAttention.forward_async(sharedFp4QuantizedTensoracross to_q/to_k/to_v).tensorrt_llm/_torch/visual_gen/models/ltx2/transformer_ltx2.pyLTX2Attention.forward_asyncoverride.tensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/utils_ltx2.pytensorrt_llm/_torch/visual_gen/models/wan/transformer_wan.pyattn1) reads the gate; SEPARATE_QKV when on.tests/unittest/_torch/thop/parallel_hw_agnostic/test_ulysses_permute_scatter.py(new)tests/unittest/_torch/thop/parallel_hw_agnostic/test_ulysses_post_unscatter.py(new)tests/unittest/_torch/visual_gen/multi_gpu/test_ulysses_async.py(new)Follow-ups
cudaMemcpyBatchAsyncunder CUDA Graph capture (CUDA 13.5). Today the capture branch falls back to aP-1-iteration per-peercudaMemcpyAsyncloop becausecudaMemcpyBatchAsyncis not graph-capture-safe. Per NVBugs 5853376 and NVBugs 5760690, targeted at CUDA 13.5 (Q4 2026). We triedcudaGraphAddMemcpyNode1Dwith sibling-DAG independence as a stop-gap; nsys confirmed runtime serializes the sibling memcpys on the same HW queue (~4 µs gaps, no CE fanout), so the per-peer loop wins today.Collapse N barriers at join → 1 barrier. Current
_join_asyncdrains N symm-mem barriers in a loop (one per deferred push) — kept 1:1 with the original protocol for risk-free refactor. The barriers are semantically identical (all channel-0 PG-level fences); a single barrier after[push_V, push_Q, push_K]should sync the recv buffers just as well (side-stream FIFO + cross-rank barrier covers all prior pushes). Saves 2 of 3barrier_kernelper layer (~186 µs × N_layers per step on B200). Requires verifying that all ranks symmetrically issue 1 barrier per join (slot rotation is independent of barrier count —nextSlotIdxticks in_prepare, not_barrier).Drop
@torch.compiler.disable, native multi-streamtorch.compile. Stream-switch boundaries in_issue_async/_join_asyncare guarded by@torch.compiler.disable(recursive=False)because inductor doesn't yet modelwith cuda.stream(...)+event.record/wait_eventas first-class graph ops (4 graph breaks per layer × N layers). Inline oncetorch.compile's multi-stream support matures.Fuse pre-A2A permute into the RMSNorm+RoPE kernel. We already own the fused RMSNorm+RoPE kernel (see
fuse_qk_norm_rope=Truepath). Passing the slot'ssend_bufpointer + permute index map directly into the kernel epilogue lets it write the permuted layout straight into the symm-mem slot, eliminating the separateulyssesPermuteScatterKernellaunch on the critical path. Leaves the post-unscatter kernel untouched.Test Coverage
tests/unittest/_torch/thop/parallel_hw_agnostic/:test_ulysses_permute_scatter.pyandtest_ulysses_post_unscatter.pyvalidate the two new fused CUDA kernels against torch references in isolation.tests/unittest/_torch/visual_gen/multi_gpu/test_ulysses_async.py: threetorch.multiprocessing.spawntests covering the productionprepare/push/barrierpath end-to-end —test_slot_ring_wraparound: eager loop >kNumSlots*2iterations, byte-exact vsall_to_all_4d. Catches off-by-one slot-reuse bugs.test_capture_smoke: warm slot out-of-capture, thentorch.cuda.CUDAGraphreplay 8× with fresh inputs, byte-exact vsall_to_all_4d. Exercises the per-peercudaMemcpyAsyncpath used under production cuda_graph.test_multi_pg: two distinct PGs spanning the same ranks, alternate calls, byte-exact vs each PG'sall_to_all_4d. Exercises PG-name caching ingetOrCreateOpandset_group_infore-registration across groups.DGX_B200-4_GPUs-PyTorch-1,DGX_B200-8_GPUs-PyTorch-1(covers Ulysses ws=4 and ws=8 in both LTX-2 and WAN end-to-end).PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.