This document explains how to manage cluster health and perform maintenance tasks in Cluster Director.
In addition to running automated health checks before starting jobs and replacing unhealthy nodes, Cluster Director provides several tools to help you proactively manage cluster resilience and help ensure optimal operation for your clusters. Specifically, you can do one or more of the following:
Manually start planned maintenance events.
Report and repair unhealthy nodes.
Verify cluster health.
Bypass GPU health checks for a specific job.
Before you begin
Select the tab for how you plan to use the samples on this page:
Console
When you use the Google Cloud console to access Google Cloud services and APIs, you don't need to set up authentication.
gcloud
In the Google Cloud console, activate Cloud Shell.
At the bottom of the Google Cloud console, a Cloud Shell session starts and displays a command-line prompt. Cloud Shell is a shell environment with the Google Cloud CLI already installed and with values already set for your current project. It can take a few seconds for the session to initialize.
Required roles
To get the permissions that
you need to view and manage clusters health,
ask your administrator to grant you the
Hypercompute Cluster Editor (roles/hypercomputecluster.editor)
IAM role on the project.
For more information about granting roles, see Manage access to projects, folders, and organizations.
This predefined role contains the permissions required to view and manage clusters health. To see the exact permissions that are required, expand the Required permissions section:
Required permissions
The following permissions are required to view and manage clusters health:
-
To view the details of a single cluster:
hypercomputecluster.clusters.describe
You might also be able to get these permissions with custom roles or other predefined roles.
Perform cluster management tasks
The following sections describe how you can manage the health of your cluster.
Manually start planned maintenance events
If a maintenance event is planned for your cluster, then you can immediately start maintenance rather than waiting for the planned time.
To manually start a planned maintenance event for your cluster by using the Google Cloud console, complete the following steps:
In the Google Cloud console, go to the Cluster Director page.
In the navigation menu, click Clusters. The Clusters page appears.
In the Clusters table, in the Name column, click the name of the cluster that you want to view the details of. A page that gives the details of the cluster appears, and the Details tab is selected.
Click the Topology tab. Then, if a maintenance event is scheduled for your cluster, click Start maintenance now.
Report and repair unhealthy nodes
If a node is unlealthy due to a persistent host error, then you can report the node to start a repair operation. This action helps you minimize workload disruption and restore the cluster to an optimal state. If you have reserved capacity available or if capacity is available in the cluster's region, then Cluster Director replaces the node with a healthy one.
To report and repair an unhealthy node in your cluster by using the Google Cloud console, complete the following steps:
In the Google Cloud console, go to the Cluster Director page.
In the navigation menu, click Clusters. The Clusters page appears.
In the Clusters table, in the Name column, click the name of the cluster that you want to view the details of. A page that gives the details of the cluster appears, and the Details tab is selected.
Click the Topology tab. Then, if one or more nodes are unhealthy, click Report all unhealthy nodes.
Verify cluster health
Before running a job on a compute node, Slurm automatically runs a quick GPU health check on the node. If the node fails the check, then Slurm drains the node and prevents scheduling new jobs on it.
To more thoroughly test GPU health and network bandwidth across the compute nodes in a cluster partition, you can manually run NVIDIA Collective Communications Library (NCCL) tests. If an NCCL test identifies any unhealthy nodes, then you can repair the nodes or modify your cluster. Running NCCL tests helps you verify a partition's health and troubleshoot issues before you run critical workloads.
Based on the machine series that the compute nodes in a cluster partition use, select one of the following options:
A4X
If you haven't already, then connect to a login node in your cluster.
In the
$HOMEdirectory, download the following NCCL scripts:wget https://raw.githubusercontent.com/GoogleCloudPlatform/cluster-toolkit/refs/tags/v1.65.0/examples/machine-learning/a4-highgpu-8g/system_benchmarks/run-nccl-tests-via-ramble.shTo view a list of network interface names that the compute nodes in a cluster partition use, run the following command:
srun -N 1 --partition=PARTITION_NAME ip addr showReplace
PARTITION_NAMEwith the name of the cluster partition that you want to test. To review the partitions in your cluster, view the details of your cluster.To help ensure that the network interface names and the name of the partition to test are correct, complete the following steps:
Open the
run-nccl-tests-via-ramble.shfile with a text editor of your choice.Compare the network interface names in the
run-nccl-tests-via-ramble.shfile with the network interface names that you viewed in the output of the previous command. If they don't match, then, in theenv_varssection within therun-nccl-tests-via-ramble.shfile, edit theOMPI_MCA_btl_tcp_if_include,UCX_NET_DEVICES, andNCCL_SOCKET_IFNAMEvariables as follows:env_vars: set: OMPI_MCA_btl_tcp_if_include: TCP_INTERFACE_NAME PMIX_MCA_gds: ^ds12 UCX_NET_DEVICES: UCX_NET_DEVICES_LIST PMIX_MCA_psec: native UCX_IB_FORK_INIT: n NCCL_NET: gIB NCCL_SOCKET_IFNAME: NCCL_SOCKET_IFNAMES LD_LIBRARY_PATH: /usr/local/gib/lib64:usr/local/nvidia/libReplace the following:
TCP_INTERFACE_NAME: the name of the Transmission Control Protocol (TCP) interface to include—for example,enp0s1.UCX_NET_DEVICES_LIST: a comma-separated list of Unified Communication X (UCX) network devices—for example,gpu0rdma0,gpu1rdma0,gpu2rdma0,gpu3rdma0.NCCL_SOCKET_IFNAMES: a comma-separated list of NCCL socket interface names—for example,enp0s1,enp192s1.
If the partition where the node that you want to test isn't the default partition, then, at the end of the
sbatchsection, add the following line:#SBATCH --partition=PARTITION_NAMEBy default, the test runs the
AllGather,AllReduce, andReduceScatterbenchmarks across an incremental and increasing number of nodes, up to the number of nodes that are available in the cluster partition. The test submits a separate job for each combination of benchmarks and number of nodes to test. To edit the benchmarks to run or the number of nodes to test, in theapplicationssection, edit theworkloadandn_nodesvariables as follows:applications: nccl-tests: workloads: '{workload}': experiments: '{workload}-{n_nodes}': variables: workload: [NCCL_BENCHMARK] n_nodes: [NUMBER_OF_NODES] matrix: - n_nodes - workloadReplace the following:
NCCL_BENCHMARK: a comma-separated list of NCCL benchmarks to run on the nodes—for example,all-gather, all-reduce, reduce-scatter.NUMBER_OF_NODES: a comma-separated list of the different number of nodes the test runs against. Specify values between1and the number of nodes in your cluster partition—for example,2, 4, 8, 16, 32.
To run the NCCL test script, run the following command:
nohup bash ./run-nccl-tests-via-ramble.sh "$HOME" >& nccl-$(date -Iseconds).log & tail -f nccl-*.logRunning the NCCL test can take some time to complete. When it does complete, the output is similar to the following:
... ---- SUMMARY for >1GB Message Sizes ---- workload n_nodes msg_size busbw all-gather 2 1073741824 XXX.XX all-gather 2 2147483648 XXX.XX all-gather 2 4294967296 XXX.XX all-gather 2 8589934592 XXX.XX ... all-reduce 2 1073741824 XXX.XX ... reduce-scatter 2 1073741824 XXX.XX ... -------- Benchmarking Complete -------
A4
If you haven't already, then connect to a login node in your cluster.
In the
$HOMEdirectory, download the following NCCL scripts:wget https://raw.githubusercontent.com/GoogleCloudPlatform/cluster-toolkit/refs/tags/v1.65.0/examples/machine-learning/a4-highgpu-8g/system_benchmarks/run-nccl-tests-via-ramble.shTo view a list of network interface names that the compute nodes in a cluster partition use, run the following command:
srun -N 1 --partition=PARTITION_NAME ip addr showReplace
PARTITION_NAMEwith the name of the cluster partition that you want to test. To review the partitions in your cluster, view the details of your cluster.To help ensure that the network interface names and the name of the partition to test are correct, complete the following steps:
Open the
run-nccl-tests-via-ramble.shfile with a text editor of your choice.Compare the network interface names in the
run-nccl-tests-via-ramble.shfile with the network interface names that you viewed in the output of the previous command. If they don't match, then, in theenv_varssection within therun-nccl-tests-via-ramble.shfile, edit theOMPI_MCA_btl_tcp_if_includeandNCCL_SOCKET_IFNAMEvariables as follows:env_vars: set: OMPI_MCA_btl_tcp_if_include: TCP_INTERFACE_NAME PMIX_MCA_gds: ^ds12 NCCL_NET: gIB NCCL_SOCKET_IFNAME: NCCL_SOCKET_IFNAMES LD_LIBRARY_PATH: /usr/local/gib/lib64:usr/local/nvidia/libReplace the following:
TCP_INTERFACE_NAME: the name of the Transmission Control Protocol (TCP) interface to include—for example,enp0s1.NCCL_SOCKET_IFNAMES: a comma-separated list of NCCL socket interface names—for example,enp0s1,enp192s1.
If the partition where the node that you want to test isn't the default partition, then, at the end of the
sbatchsection, add the following line:#SBATCH --partition=PARTITION_NAMEBy default, the test runs the
AllGather,AllReduce, andReduceScatterbenchmarks across an incremental and increasing number of nodes, up to the number of nodes in the cluster partition. The test submits a separate job for each combination of benchmarks and number of nodes to test. To edit the benchmarks to run or the number of nodes to test, in theapplicationssection, edit theworkloadandn_nodesvariables as follows:applications: nccl-tests: workloads: '{workload}': experiments: '{workload}-{n_nodes}': variables: workload: [NCCL_BENCHMARK] n_nodes: [NUMBER_OF_NODES] matrix: - n_nodes - workloadReplace the following:
NCCL_BENCHMARK: a comma-separated list of NCCL benchmarks to run on the nodes—for example,all-gather, all-reduce, reduce-scatter.NUMBER_OF_NODES: a comma-separated list of the different number of nodes the test runs against. Specify values between1and the number of nodes in your cluster partition—for example,2, 4, 8, 16, 32.
To run the NCCL test script, run the following command:
nohup bash ./run-nccl-tests-via-ramble.sh "$HOME" >& nccl-$(date -Iseconds).log & tail -f nccl-*.logRunning the NCCL test can take some time to complete. When it does complete, the output is similar to the following:
... ---- SUMMARY for >1GB Message Sizes ---- workload n_nodes msg_size busbw all-gather 2 1073741824 XXX.XX all-gather 2 2147483648 XXX.XX all-gather 2 4294967296 XXX.XX all-gather 2 8589934592 XXX.XX ... all-reduce 2 1073741824 XXX.XX ... reduce-scatter 2 1073741824 XXX.XX ... -------- Benchmarking Complete -------
A3 Ultra
If you haven't already, then connect to a login node in your cluster.
In the
$HOMEdirectory, download the following NCCL scripts:wget https://raw.githubusercontent.com/GoogleCloudPlatform/cluster-toolkit/refs/tags/v1.65.0/examples/machine-learning/a3-ultragpu-8g/nccl-tests/import_pytorch_container.sh wget https://raw.githubusercontent.com/GoogleCloudPlatform/cluster-toolkit/refs/tags/v1.65.0/examples/machine-learning/a3-ultragpu-8g/nccl-tests/build-nccl-tests.sh wget https://raw.githubusercontent.com/GoogleCloudPlatform/cluster-toolkit/refs/tags/v1.65.0/examples/machine-learning/a3-ultragpu-8g/nccl-tests/run-nccl-tests.shTo view a list of network interface names that the compute nodes in a cluster partition use, run the following command:
srun -N 1 --partition=PARTITION_NAME ip addr showReplace
PARTITION_NAMEwith the name of the cluster partition that you want to test. To review the partitions in your cluster, view the details of your cluster.To help ensure that the network interface names and the name of the partition to test are correct, complete the following steps:
Open the
build-nccl-tests.shandrun-nccl-tests.shfiles with a text editor of your choice.Compare the network interface names in the
run-nccl-tests.shfile with the network interface names that you viewed in the output of the previous command. If they don't match, then, in therun-nccl-tests.shfile, edit theNCCL_SOCKET_IFNAMEvariable as follows:source /usr/local/gib/scripts/set_nccl_env.sh export NCCL_NET=gIB export NCCL_SOCKET_IFNAME=NCCL_SOCKET_IFNAMESReplace
NCCL_SOCKET_IFNAMESwith a comma-separated list of NCCL socket interface names—for example,enp0s1,enp192s1.If the partition where your node exists isn't the default cluster's partition, then, in the
sbatchsection of thebuild-nccl-tests.shandrun-nccl-tests.shfiles, add the following line:#SBATCH --partition=PARTITION_NAMEOptional: To run a different NCCL benchmark than the default
AllGatherbenchmark, edit the last line of therun-nccl-tests.shfile as follows:/nccl/nccl-tests/build/NCCL_BENCHMARK -b 8M -e 8G -f 2 -g 1 -w 5 --iters 200;Replace
NCCL_BENCHMARKwith the type of NCCL benchmark that you want to run on the cluster nodes—for example,all_reduce_perf.
To import the squash container image, run the following command:
bash ./import_pytorch_container.shImporting the squash container image takes approximately 10 minutes to complete.
To build the test, as well as grant the test access to all of a node's resources by using the
--exclusiveflag, run the following command:sbatch --partition=PARTITION_NAME --exclusive build-nccl-tests.shBuilding the test takes approximately five minutes to complete.
To run the NCCL test, run the following command:
sbatch -N TEST_SIZE --partition=PARTITION_NAME --exclusive run-nccl-tests.shReplace
TEST_SIZEwith the number of nodes that you want to test in your nodeset. Specify a value between1and the number of nodes in your cluster partition.Running the NCCL test takes approximately five minutes to complete. When the test completes, the system creates and stores an
slurm-JOB_ID.outfile in the$HOMEdirectory with the results of your test.
A3 Mega
If you haven't already, then connect to a login node in your cluster.
In the
$HOMEdirectory, download the following NCCL scripts:wget https://raw.githubusercontent.com/GoogleCloudPlatform/cluster-toolkit/refs/tags/v1.65.0/examples/machine-learning/a3-megagpu-8g/nccl-tests/import_pytorch_container.sh wget https://raw.githubusercontent.com/GoogleCloudPlatform/cluster-toolkit/refs/tags/v1.65.0/examples/machine-learning/a3-megagpu-8g/nccl-tests/build-nccl-tests.sh wget https://raw.githubusercontent.com/GoogleCloudPlatform/cluster-toolkit/refs/tags/v1.65.0/examples/machine-learning/a3-megagpu-8g/nccl-tests/run-nccl-tests.shTo view a list of network interface names that the compute nodes in a cluster partition use, run the following command:
srun -N 1 --partition=PARTITION_NAME ip addr showReplace
PARTITION_NAMEwith the name of the cluster partition that you want to test. To review the partitions in your cluster, view the details of your cluster.To help ensure that the network interface names and the name of the partition to test are correct, complete the following steps:
Open the
build-nccl-tests.shandrun-nccl-tests.shfiles with a text editor of your choice.Compare the network interface names in the
run-nccl-tests.shfile with the network interface names that you viewed in the output of the previous command. If they don't match, then, in therun-nccl-tests.shfile, edit theNCCL_FASTRAK_CTRL_DEV,NCCL_FASTRAK_IFNAME, andNCCL_SOCKET_IFNAMEvariables as follows:NCCL_LIB_DIR="/var/lib/tcpxo/lib64" source /var/lib/tcpxo/lib64/nccl-env-profile.sh export NCCL_FASTRAK_CTRL_DEV=NCCL_FASTRAK_CTRL_DEV export NCCL_FASTRAK_IFNAME=NCCL_FASTRAK_IFNAME export NCCL_SOCKET_IFNAME=NCCL_SOCKET_IFNAMES export NCCL_FASTRAK_LLCM_DEVICE_DIRECTORY=/dev/aperture_devicesReplace the following:
NCCL_FASTRAK_CTRL_DEV: the NCCL fastrak control device interface name—for example,enp0s12.NCCL_FASTRAK_IFNAME: a comma-separated list of NCCL fastrak interface names—for example,enp6s0,enp7s0,enp13s0.NCCL_SOCKET_IFNAMES: a comma-separated list of NCCL socket interface names—for example,enp0s1.
If the partition where your node exists isn't the default cluster's partition, then, in the
sbatchsection of thebuild-nccl-tests.shandrun-nccl-tests.shfiles, add the following line at the end of the section:#SBATCH --partition=PARTITION_NAMEOptional: To run a different NCCL benchmark than the default
AllGatherbenchmark, edit the last line of therun-nccl-tests.shfile as follows:/nccl/nccl-tests/build/NCCL_BENCHMARK -b 8M -e 8G -f 2 -g 1 -w 5 --iters 200;Replace
NCCL_BENCHMARKwith the type of NCCL benchmark that you want to run on the cluster nodes—for example,all_reduce_perf.
To import the squash container image, run the following command:
bash ./import_pytorch_container.shImporting the squash container image takes approximately 10 minutes to complete.
To build the test, as well as grant the test access to all of a node's resources by using the
--exclusiveflag, run the following command:sbatch --partition=PARTITION_NAME --exclusive build-nccl-tests.shBuilding the test takes approximately five minutes to complete.
To run the NCCL test, run the following command:
sbatch -N TEST_SIZE --partition=PARTITION_NAME --exclusive run-nccl-tests.shReplace
TEST_SIZEwith the number of nodes that you want to test in your nodeset. Specify a value between1and the number of nodes in your cluster partition.Running the NCCL test takes approximately five minutes to complete. When the test completes, the system creates and stores an
slurm-JOB_ID.outfile in the$HOMEdirectory with the results of your test.
Bypass GPU health checks for a specific job
You can configure Slurm to submit a job on a compute node but skip the node's GPU health check. This approach is useful in the following scenarios:
Suspected false positives: you believe that the health check failed due to a false positive after you troubleshoot the node.
Fault-tolerant workloads: your workload is designed to be resilient to the specific hardware issue that a health check detected.
To submit a job on a compute node while skipping the GPU health check, complete the following steps:
If you haven't already, then connect to a login node in your cluster.
If a GPU health check failed, then manually undrain the node if you haven't already:
sudo scontrol update nodename=NODE_NAME state=RESUMEReplace NODE_NAME with the name of the node.
Submit a job for the specific node but skip the GPU health check:
sbatch --export=ALL,SLURM_JOB_EXTRA="healthchecks_prolog=off" JOB_SCRIPTReplace JOB_SCRIPT with the path to the job that you want to run.