Containers
We present here a basic test for containerized environments using Enroot and Pyxis, both from NVIDIA.
First, for testing enroot:
enroot import docker://ubuntu
enroot create -n ubuntu ubuntu.sqsh
enroot start ubuntu sh -c 'grep PRETTY /etc/os-release'
> PRETTY_NAME="Ubuntu 24.04.2 LTS"Secondly, we ensure we get the same results from testing Pyxis:
srun --container-image=ubuntu grep PRETTY /etc/os-release
> PRETTY_NAME="Ubuntu 24.04.2 LTS"Alternatively to use a custom image built in dockerd:
Build a custom dockerfile with:
docker build -f <file.dockerfile> -t <name:tag> .Import dockerd image to Enroot (Can be done with
docker://IMAGE:TAGfrom registry)
enroot import dockerd://<name:tag>Use flag pointing to the name:tag.sqsh
--container-image=<name:tag>.sqshExample: torchtitan multi-node
We clone cluster-tests into /home/ubuntu:
git clone https://github.com/datacrunch-research/cluster-tests.git /home/ubuntu/cluster-testsWe build the image based on torchtitan.dockerfile:
NOTE: we need to include the HF_TOKEN in .bashrc or export it in the bash session with access granted for llama3 family models.
docker build -f torchtitan.dockerfile --build-arg HF_TOKEN="$HF_TOKEN" -t torchtitan_cuda128_torch27 .Then we import the squash file, which Enroot will use:
enroot import -o /home/ubuntu/torchtitan_cuda128_torch27.sqsh dockerd://torchtitan_cuda128_torch27Now, we execute torchtitan_multinode.sh:
sbatch torchtitan_multinode.shLast updated
Was this helpful?