Slurm
Last updated
Was this helpful?
Last updated
Was this helpful?
DataCrunch On-demand clusters have job-scheduling system preinstalled. You can verify that Slurm is working by running a simple job that uses 16 GPUs. This will execute on at least two physical nodes, since each server has 8 GPUs:
The following example utilizes package system. You can also check out our and the to learn more.
Spack is pre-installed into /home/spack
.
To get spack commands available in your terminal run:
You can see how we build Spack by looking inside this script: /usr/local/bin/spack.setup.sh
Below is an example of simple Slurm job that runs the all_reduce_perf
from with 16 GPUs on two nodes.
The example job description is created by the Spack installation script and is placed here: /home/ubuntu/all_reduce_example_slurm.job
:
You can execute the above job by running:
Briefly, the above script does the following:
Lines starting with #SBATCH
define the requirements for the job
Next, Spack is initialized, making spack
command available
spack load nccl-tests
makes all_reduce_perf
available in the $PATH
srun
executes the actual command.
To read what the job printed to stdout
, look in the file slurm-jobid.out
located in the same directory where you called the job from. You can run squeue -a
to list all Slurm jobs and their status.
Get the general state of worker nodes with the command sinfo
In an unlikely case that the Slurm does not see the worker nodes after the reboot, you can make them available again with:
If a node becomes unhealthy (for example, a full /
partition), it will enter the drained state. Check the reason with: