Health checks

This page outlines the tests that you can run on your instant cluster to ensure that it is operating correctly. They are structured to tackle different layers of a typical ML workload. The scripts used in this section can be found on our repo: https://github.com/datacrunch-research/supercomputing-clusters

NCCL test

Three options are available:

nccl-tests already built with SPACK
nccl-tests using NVIDIA HPC-X suite.
nccl-tests included in torch.

NCCL test from the system

At /home/ubuntu/ :

sbatch all_reduce_example_slurm.job

NCCL from hpcx module

First, we load the HPC-X module:

modules load hpcx

Then we need to compile nccl-tests with the HPC-X binaries:

git clone https://github.com/NVIDIA/nccl-tests.git; \
cd /home/ubuntu/nccl-tests; \
make MPI=1 -j$(nproc);

We add hostfile.txt as the list of workernodes to perform the test:

cat /etc/mpihosts > hostfile.txt

Once compiled, we can test it with the following command:

mpirun -np 16 -N 8 -x NCCL_NET_PLUGIN=/opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so -hostfile hostfile.txt ./build/all_reduce_perf -b 512M -e 8G -f 2 -g 1

This script can be found on our repository here.

NCCL from Pytorch

Singlenode

Assuming https://github.com/datacrunch-research/supercomputing-clusters has been cloned on /home/ubuntu/

# on supercomputing-clusters/nccl_test
torchrun --standalone --nproc_per_node=8 nccl_torch.py --min-size 512MB --max-size 8GB --num-iters 5 --pin-memory --preallocate

Multinode

# on supercomputing-clusters/multinode_torch_distributed
sbatch slurm_multinode_torch_distributed.sh

Last updated 8 hours ago

Was this helpful?