Docs
DataCrunch HomeSDKAPILogin / Signup
  • Welcome to DataCrunch
    • Overview
    • Locations and Sustainability
    • Support
  • GPU Instances
    • Set up a GPU instance
    • Securing Your Instance
    • Shutdown, Hibernate, and Delete
    • Adding a New User
    • Block Volumes
    • Shared Filesystems (SFS)
    • Managing SSH Keys
    • Connecting to Your DataCrunch.io Server
    • Connecting to Jupyter notebook with VS Code
    • Team Projects
    • Pricing and Billing
  • Clusters
    • Instant Clusters
      • Deploying a GPU cluster
      • Slurm
      • Spack
      • Good to know
    • Customized GPU clusters
  • Containers
    • Overview
    • Container Registries
    • Scaling and health-checks
    • Batching and Streaming
    • Async Inference
    • Tutorials
      • Quick: Deploy with vLLM
      • In-Depth: Deploy with TGI
      • In-Depth: Deploy with SGLang
      • In-Depth: Deploy with vLLM
      • In-Depth: Deploy with Replicate Cog
      • In-Depth: Asynchronous Inference Requests with Whisper
  • Inference
    • Overview
    • Authorization
    • Audio Models
      • Whisper X
  • Pricing and Billing
  • Resources
    • Resources Overview
    • DataCrunch API
  • Python SDK
  • Get Free Compute Credits
Powered by GitBook
On this page
  • Cluster node naming convention
  • Storage
  • Infiniband partitioning
  • Troubleshooting
  • Other

Was this helpful?

  1. Clusters
  2. Instant Clusters

Good to know

Cluster node naming convention

Cluster node names will be based on the Hostname you specify when creating the cluster:

  • Jump host: hostname-jumphost

  • Worker nodes: hostname-1, hostname-2 , etc.

Storage

There is a shared network filesystem mounted at /homeon every node on the cluster.

Each worker node has a local NVMe drive mounted on /mnt/local_disk for extra fast I/O.

Infiniband partitioning

Worker nodes are interconnected using a partitioned 400 Gb/s Infiniband fabric with M_KEY. For this reason commands like ibhosts will not work, while distributed workloads like MPI work correctly.

To use Infiniband and NCCL from inside a Docker container make sure to set environment variable NCCL_IB_PKEY=1.

For example:

docker run -e NCCL_IB_PKEY=1

Troubleshooting

Some versions of git have problems creating temporary files on shared filesystems. A workaround for that is to set the following environment variable:

export GIT_OBJECT_DIRECTORY=/tmp/git-objects

Other

Worker nodes are using the jump host as a default gateway, NAT firewall and Slurm controller.

CUDA, OpenMPI, doca-ofed and nvidia-drivers are installed on each server

Pytorch installer setup script is available in /home/pytorch.setup.sh.

Last updated 1 month ago

Was this helpful?