Try to go through these threads:
the default for the variable was set to true in start_cluster in all cases. If you then ran a non-MoE model it was causing the issue. Now it is sync’d up across the scripts to set it to true if an MoE model or false if not.
Not really anything new here, we all figured it out a few posts back. IOW, still no way to aggregate two ports into a single 200G one, the software needs to be able to span across both logical interfaces. NCCL does that, but most TCP/IP applications can only use multiple streams across a single interface (e.g. iperf3), but not two interfaces, so for practical purposes (other than NCCL workloads), we can’t easily transfer data between two devices with speeds >100G.
**TL;DR
It appears we can achieve 400G aggregate by connecting the DGX Spark with two cables.
Can we enable GPU Direct RDMA to potentially achieve closer to 400 Gbps (50 GB/s) aggregate bandwidth on the next firmware update?
Dual 200G QSFP Setup Achieving Only 208 Gbps - GPU Direct RDMA Issues on ARM64**
This is as close as I got:
[Screenshot from 2025-11-20 01-36-37]
[image]
Environment Details:
Hardware:
2x NVIDIA DGX Spark (Blackwell GB10 GPUs)
2x 200G QSFP56 passive DAC cables (MC…
Very useful information and discussions to understand how the platform actually works and some recipes to stack the sparks properly, configure NCCL, validate the setup and run distributed inference.