DGX Spark Connection Failed alert

I am currently attempting to configure a cluster by connecting two DGX Spark systems with a QSFP 200Gb cable. I followed the tutorial in the DGX Spark Resource for the connection and proceeded with the NCCL configuration.

When running tests in the terminal, I confirmed that the output appears as intended, but I’m concerned about the persistent “connection failed” alerts that keep occurring.

To identify the problem, I clicked on the alert, but it only disappears without revealing the cause of the issue, leaving me unable to determine the root cause.

I’m wondering whether the cable is the problem, the DGX Spark systems are the issue, or if there’s a problem in the setup process.

Is there a way to check each component individually to verify if there are any issues?

The connection failed warning will show up even if your Sparks are successfully connected. You can ignore this for now and we will see on removing this warning in a future update.

1 Like

Try to go through these threads:

Very useful information and discussions to understand how the platform actually works and some recipes to stack the sparks properly, configure NCCL, validate the setup and run distributed inference.

This message actually comes from the Ubuntu Network Manager, which does not see the IPs that you assign to the CX7 interfaces. You can use the GUI to assign the given IPs with Network Manager to get rid of the message or use nmcli if you want to use command line