Enroot + Pyxis
We present here a basic test for containerized environments using Enroot and Pyxis, both from NVIDIA. Then, we will walk through a multi-node training workload based on a containerized version of TorchTitan.
First, for testing enroot:
enroot import docker://ubuntu
enroot create -n ubuntu ubuntu.sqsh
enroot start ubuntu sh -c 'grep PRETTY /etc/os-release'
> PRETTY_NAME="Ubuntu 24.04.2 LTS"
Secondly, we ensure we get the same results from testing Pyxis:
srun --container-image=ubuntu grep PRETTY /etc/os-release
> PRETTY_NAME="Ubuntu 24.04.2 LTS"
Assuming we have cloned https://github.com/datacrunch-research/supercomputing-clusters at /home/ubuntu/
We build the image based on the Dockerfile in supercomputing-cluster/enroot_pyxis
Care to include the HF_TOKEN in .bashrc or export it in the bash session
docker build -f torchtitan.dockerfile --build-arg HF_TOKEN="$HF_TOKEN" -t torchtitan_cuda128_torch27 .
Then we import the squash file Enroot will be using:
enroot import -o /home/ubuntu/torchtitan_cuda128_torch27.sqsh dockerd://torchtitan_cuda128_torch27
To leverage local storage on worker nodes, we will create the following folder and store the model inside: /mnt/local_disk/huggingface
sudo mkdir /mnt/local_disk/huggingface
sudo chown -R ubuntu:ubuntu /mnt/local_disk/huggingface/
Now, we use the sbatch script for multinode residing in supercomputing-cluster/enroot_pyxis/multinode_torchtitan.sh
sbatch multinode_torchtitan.sh
As a result, we should expect some logs in/home/ubuntu/slurm_logging/headnode/torchtitan_multinode_X.err and .out where X is the slurm job ID.
Last updated
Was this helpful?