Containers

We present here a basic test for containerized environments using Enroot and Pyxis, both from NVIDIA.

First, for testing enroot:

enroot import docker://ubuntu
enroot create -n ubuntu ubuntu.sqsh
enroot start ubuntu sh -c 'grep PRETTY /etc/os-release'
> PRETTY_NAME="Ubuntu 24.04.2 LTS"

Secondly, we ensure we get the same results from testing Pyxis:

srun --container-image=ubuntu grep PRETTY /etc/os-release
> PRETTY_NAME="Ubuntu 24.04.2 LTS"

Alternatively to use a custom image built in dockerd:

  1. Build a custom dockerfile with:

docker build -f <file.dockerfile> -t <name:tag> .
  1. Import dockerd image to Enroot (Can be done with docker://IMAGE:TAG from registry)

enroot import dockerd://<name:tag>
  1. Use flag pointing to the name:tag.sqsh

--container-image=<name:tag>.sqsh

Example: torchtitan multi-node

We clone cluster-tests into /home/ubuntu:

git clone https://github.com/datacrunch-research/cluster-tests.git /home/ubuntu/cluster-tests

We build the image based on torchtitan.dockerfile:

NOTE: we need to include the HF_TOKEN in .bashrc or export it in the bash session with access granted for llama3 family models.

docker build -f torchtitan.dockerfile --build-arg HF_TOKEN="$HF_TOKEN" -t torchtitan_cuda128_torch27 .

Then we import the squash file, which Enroot will use:

enroot import -o /home/ubuntu/torchtitan_cuda128_torch27.sqsh dockerd://torchtitan_cuda128_torch27

Now, we execute torchtitan_multinode.sh:

sbatch torchtitan_multinode.sh

Last updated

Was this helpful?