Good to know

Cluster node naming convention

Cluster node names will be based on the Hostname you specify when creating the cluster:

  • Jump host: hostname-jumphost

  • Worker nodes: hostname-1, hostname-2 , etc.

Storage

There is a shared network filesystem mounted at /homeon every node on the cluster.

Each worker node has a local NVMe drive mounted on /mnt/local_disk for extra fast I/O.

Storage with Docker

For Docker workloads it is recommended to change the Docker root folder storage, as by default it uses the system disk / . To perform this, you will need:

  1. Create a folder on /mnt/local_disk/docker for local NVMe Docker files self-management or in /home/ubuntu/docker for NFS shared folder

  2. Modify Docker daemon settings in /etc/docker/daemon.json as follows (remember to do this via sudo):

{
    "data-root": "/mnt/local_disk/docker", # We add this to specify the Docker root folder
    "runtimes": {
        "nvidia": {
            "args": [],
            "path": "nvidia-container-runtime"
        }
    }
}
  1. Restart Docker service:

sudo systemctl restart docker
  1. Perform a sanity check by running:

docker info

Expected output:

> [...]
> Docker Root Dir: /mnt/local_disk/docker
> [...]

Infiniband partitioning

Worker nodes are interconnected using a partitioned 400 Gb/s Infiniband fabric with M_KEY. For this reason commands like ibhosts will not work, while distributed workloads like MPI work correctly.

To use Infiniband and NCCL from inside a Docker container make sure to set environment variable NCCL_IB_PKEY=1.

For example:

docker run -e NCCL_IB_PKEY=1

Troubleshooting

Some versions of git have problems creating temporary files on shared filesystems. A workaround for that is to set the following environment variable:

export GIT_OBJECT_DIRECTORY=/tmp/git-objects

Other

Worker nodes are using the jump host as a default gateway, NAT firewall and Slurm controller.

CUDA, OpenMPI, doca-ofed and nvidia-drivers are installed on each server

Pytorch installer setup script is available in /home/pytorch.setup.sh.

Last updated

Was this helpful?