Monitoring
Our instant clusters come with a dashboards and customizable alerts, to monitor the state of the cluster.
To access the dashboard navigate to the cluster dropdown and select View metric dashboard.

From here follow the instruction to get the address and login details for the Grafana portal.
Once inside click on Dashboards on the side tab where you will see four different dashboards:
GPU Overview: A general GPU monitoring dashboard.
NVIDIA DCGM: A dashboard containing the DCGM information.
Node Exporter Full:
SLURM Dashboard: A dashboard monitoring all SLURM job information.
In addition to this the cluster comes set up with various pre-installed alerts which can be seen on the alert tab. More can be added and also set up to notify via a Slack plugin for example.

Last updated
Was this helpful?