Containers
We present here a basic test for containerized environments using Enroot and Pyxis, both from NVIDIA.
First, for testing enroot:
enroot import docker://ubuntu
enroot create -n ubuntu ubuntu.sqsh
enroot start ubuntu sh -c 'grep PRETTY /etc/os-release'
> PRETTY_NAME="Ubuntu 24.04.2 LTS"
Secondly, we ensure we get the same results from testing Pyxis:
srun --container-image=ubuntu grep PRETTY /etc/os-release
> PRETTY_NAME="Ubuntu 24.04.2 LTS"
Alternatively to use a custom image built in dockerd:
Build a custom dockerfile with:
docker build -f <file.dockerfile> -t <name:tag> .
Import dockerd image to Enroot (Can be done with
docker://IMAGE:TAG
from registry)
enroot import dockerd://<name:tag>
Use flag pointing to the name:tag.sqsh
--container-image=<name:tag>.sqsh
Example: torchtitan multi-node
We clone cluster-tests into /home/ubuntu
:
git clone https://github.com/datacrunch-research/cluster-tests.git /home/ubuntu/cluster-tests
We build the image based on torchtitan.dockerfile:
NOTE: we need to include the HF_TOKEN in .bashrc or export it in the bash session with access granted for llama3 family models.
docker build -f torchtitan.dockerfile --build-arg HF_TOKEN="$HF_TOKEN" -t torchtitan_cuda128_torch27 .
Then we import the squash file, which Enroot will use:
enroot import -o /home/ubuntu/torchtitan_cuda128_torch27.sqsh dockerd://torchtitan_cuda128_torch27
Now, we execute torchtitan_multinode.sh:
sbatch torchtitan_multinode.sh
Last updated
Was this helpful?