Slurm
Overview
DataCrunch On-demand clusters have Slurm job-scheduling system preinstalled. You can verify that Slurm is working by running a simple job that uses 16 GPUs. This will execute on at least two physical nodes, since each server has 8 GPUs:
srun --gpus=16 nvidia-smi
Example Slurm job
Pre-requisites
The following example utilizes Spack package system. You can also check out our simple Spack tutorial and the official Spack documentation to learn more.
Spack is pre-installed into /home/spack
.
To get spack commands available in your terminal run:
. /home/spack/spack/share/spack/setup-env.sh
You can see how we build Spack by looking inside this script: /usr/local/bin/spack.setup.sh
Simple Slurm job
Below is an example of simple Slurm job that runs the all_reduce_perf
from nccl-tests with 16 GPUs on two nodes.
The example job description is created by the Spack installation script and is placed here: /home/ubuntu/all_reduce_example_slurm.job
:
#!/bin/bash
#SBATCH --job-name=all_reduce_perf
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=8
#SBATCH --gres=gpu:8
#SBATCH --cpus-per-task=22
#SBATCH --time=00:05:00
. /home/spack/spack/share/spack/setup-env.sh
spack load nccl-tests
srun --mpi=pmix all_reduce_perf -b 1M -e 1G -f 2 -g 1
You can execute the above job by running:
$ sbatch /home/ubuntu/all_reduce_example_slurm.job
Briefly, the above script does the following:
Lines starting with
#SBATCH
define the requirements for the jobNext, Spack is initialized, making
spack
command availablespack load nccl-tests
makesall_reduce_perf
available in the$PATH
srun
executes the actual command.
To read what the job printed to stdout
, look in the file slurm-jobid.out
located in the same directory where you called the job from. You can run squeue -a
to list all Slurm jobs and their status.
Troubleshooting
Get the general state of worker nodes with the command sinfo
In an unlikely case that the Slurm does not see the worker nodes after the reboot, you can make them available again with:
scontrol update NodeName=cluster-name-1 State=RESUME
If a node becomes unhealthy (for example, a full /
partition), it will enter the drained state. Check the reason with:
scontrol show node cluster-name-1
To troubleshoot Slurm distributed workloads it's crucial to capture logs from both the head node and the worker nodes.
Logging
Use the following #SBATCH
directives at the top of your Slurm script to capture logs produced by commands executed in the main script body (typically run on the head node):
#SBATCH --output=/path/to/logs/headnode/jobname_%j.out
#SBATCH --error=/path/to/logs/headnode/jobname_%j.err
--output
captures theSTDOUT
from the head node script body--error
captures theSTDERR
from the head node script body%j
is replaced with the Slurm job ID, keeping logs organized
To capture logs from each worker node, use the --output
and --error
flags inside the srun
command:
srun \
--output=/path/to/logs/workernodes/jobname_%j_node%N.out \
--error=/path/to/logs/workernodes/jobname_%j_node%N.err \
your_distributed_command
%j
- Slurm job id%N
- node name (e.g.,node001
), ensuring per-node log separation.
These logs will capture:
Logs from each task/rank running under
srun
Application-specific output (e.g. print statements, warnings, stack traces)
Environment or setup errors during distributed execution
Last updated
Was this helpful?