Enroot + Pyxis

We present here a basic test for containerized environments using Enroot and Pyxis, both from NVIDIA. Then, we will walk through a multi-node training workload based on a containerized version of TorchTitan.

First, for testing enroot:

enroot import docker://ubuntu
enroot create -n ubuntu ubuntu.sqsh
enroot start ubuntu sh -c 'grep PRETTY /etc/os-release'
> PRETTY_NAME="Ubuntu 24.04.2 LTS"

Secondly, we ensure we get the same results from testing Pyxis:

srun --container-image=ubuntu grep PRETTY /etc/os-release
> PRETTY_NAME="Ubuntu 24.04.2 LTS"

Assuming we have cloned https://github.com/datacrunch-research/supercomputing-clusters at /home/ubuntu/

We build the image based on the Dockerfile in supercomputing-cluster/enroot_pyxis

Care to include the HF_TOKEN in .bashrc or export it in the bash session

docker build -f torchtitan.dockerfile --build-arg HF_TOKEN="$HF_TOKEN" -t torchtitan_cuda128_torch27 .

Then we import the squash file Enroot will be using:

enroot import -o /home/ubuntu/torchtitan_cuda128_torch27.sqsh dockerd://torchtitan_cuda128_torch27

To leverage local storage on worker nodes, we will create the following folder and store the model inside: /mnt/local_disk/huggingface

sudo mkdir /mnt/local_disk/huggingface
sudo chown -R ubuntu:ubuntu /mnt/local_disk/huggingface/

Now, we use the sbatch script for multinode residing in supercomputing-cluster/enroot_pyxis/multinode_torchtitan.sh

sbatch multinode_torchtitan.sh

As a result, we should expect some logs in/home/ubuntu/slurm_logging/headnode/torchtitan_multinode_X.err and .out where X is the slurm job ID.

Last updated 1 hour ago

Was this helpful?